The ancient art of… screen-scraping

12Dec08

I spent most of yesterday and part of the day before writing a screen scraper for Wikipedia with the Hpricot library. I just wrote it to learn Hpricot, as it seemed an interesting and useful library to know — and now, having completed the script, I am a bit stumped as to how to apply it to something bigger. I do not foresee myself doing an awful lot of primitive screen scraping of this exact nature in the future. I am still thinking of a worthy project to devote myself to!

Here it is:

#!/usr/bin/env ruby
# Wikiscraper
# written by mentallaxative
#

require 'rubygems'
require 'hpricot'
require 'open-uri'

class Wikiscraper

  def initialize(arguments)
    #ARGV needs to be cleared else it ends up in 'gets'
    @keyword = arguments.join(" ")
    substitutions = {' ' => '+', '(' => '%28', ')' => '%29'}
    substitutions.each_pair {|a,b| @keyword.gsub!(a, b) }
    arguments.clear
    
    @search_links = []
  end

  def main
    process_search
    populate_arrays
    get_selection
    final_page_processing
  end

  #open our search
  def process_search
    search_uri = "http://en.wikipedia.org/wiki/Special:Search?search=#{@keyword}&fulltext=Search"
    @s_page = Hpricot(open(search_uri))
  end

  def populate_arrays
    #results_data contains all search result names
    results_data = (@s_page/"ul.mw-search-results/li/:not(div.mw-search-result-data)").inner_text

    #search_links is an array of links of search results
    (@s_page/"ul.mw-search-results"/:a).each do |ah|
      @search_links << ah.attributes['href'].sub!('/wiki/', 'http://en.wikipedia.org/wiki/') 
    end

    s_entries = []
    results_data.each_line {|x| s_entries << "#{x}\r"}
    #print out the search results
    s_entries.each_index {|x| print "#{x+1}: #{s_entries[x]}"}
  end

  def get_selection
    toc_string = &#039;<table class="toc" id="toc" summary="Contents">'

    puts "\nType in a entry number:"
    @number = gets.chomp!.to_i-1
    selection = Hpricot(open(@search_links[@number]))

    #gets rid of everything after the table of contents
    if selection.to_s.index(toc_string) == nil
      @wiki_results = selection
    else
      @wiki_results = Hpricot(selection.to_s.slice(1..selection.to_s.index(toc_string)))
    end
  end

  def final_page_processing
    (@wiki_results/"table.infobox").remove
    @text_body = (@wiki_results/"#bodyContent/p/:not(#coordinates)")

    no_article_found = (@wiki_results/"div.noarticletext")

    if not no_article_found.empty?
      puts no_article_found.inner_text
    else
      puts((@wiki_results/"div.dablink").inner_text) #disambig info
      puts "\n"
      #get rid of citation marks like [1], [3], [12]
      @text_body.inner_text.gsub(/\[[\w\d]{0,3}\]/, "").each {|c| puts c}
    end
  end

end

if ARGV.empty?
  puts "Please supply a subject to search Wikipedia for."
else
  scrape = Wikiscraper.new(ARGV)
  scrape.main
end

The script takes all of its command line arguments and joins them into one keyword to search. It then returns a list of search results and asks you to input a number of a search entry you would like to see. Here is an example of it being used:


[colin@workbench wikiscrape]$ ./wikiscrape albert einstein
1: Albert Einstein
2: Albert einstein
3: Albert Einstein College of Medicine
4: Albert Einstein Medal
5: Albert Einstein Award
6: Hans Albert Einstein
7: Albert Einstein Memorial
8: Albert Einstein World Award of Science
9: Albert Einstein's brain
10: General relativity (redirect EINSTEINS THEORY OF GRAVITATION)
11: Albert Einstein Society
12: Albert Einstein High School
13: Albert Einstein Institution
14: Albert Einstein Peace Prize
15: Albert Einstein House
16: Albert Einstein Medical Center
17: List of scientific publications by Albert Einstein
18: Max Planck Institute for Gravitational Physics (redirect Albert Einstein Institute)
19: Albert Einstein in popular culture
20: Albert Einstein: The Practical Bohemian
Type in a entry number:
1
"Einstein" redirects here. For other uses, see Einstein (disambiguation).
Albert Einstein (German: IPA: [ˈalbɐt ˈaɪ̯nʃtaɪ̯n] (Audio file) (help·info); English: IPA: /ˈælbɝt (-ət) ˈaɪnstaɪn/) (14 March 1879 – 18 April 1955) was a German-born theoretical physicist. He is best known for his theory of relativity and specifically mass–energy equivalence, expressed by the equation E = mc2. Einstein received the 1921 Nobel Prize in Physics "for his services to Theoretical Physics, and especially for his discovery of the law of the photoelectric effect."Einstein's many contributions to physics include his special theory of relativity, which reconciled mechanics with electromagnetism, and his general theory of relativity, which was intended to extend the principle of relativity to non-uniform motion and to provide a new theory of gravitation. His other contributions include advances in the fields of relativistic cosmology, capillary action, critical opalescence, classical problems of statistical mechanics and their application to quantum theory, an explanation of the Brownian movement of molecules, atomic transition probabilities, the quantum theory of a monatomic gas, thermal properties of light with low radiation density (which laid the foundation for the photon theory), a theory of radiation including stimulated emission, the conception of a unified field theory, and the geometrization of physics.Einstein published over 300 scientific works and over 150 non-scientific works. In 1999 Time magazine named him the "Person of the Century". In wider culture the name "Einstein" has become synonymous with genius, and he has since been regarded as one of the most influential people in human history.
[colin@workbench wikiscrape]$

This page is relatively narrow so it looks messier than it does in the terminal. I removed some empty lines which were interfering with the code tags, so it is less crammed in as well. The script only prints out the overview of the article, which is enough for me as I find that it is often an adequate summary of a topic.

Advertisements


%d bloggers like this: