*echoes the title*

Back in the way, like a lot of other people who were just venturing out into the cold exterior of the Internet, I made a Geocities page. I attended to it meticulously like a Zen gardener until one day I blew up and didn’t look at it again. And I was amazed, years later, to discover it was still around.

I am experiencing that same sort of feeling right now, though it’s on a smaller scale!

I note that it has been almost a full year since my last post to this blog. A lot has happened since then. I got fed up with Linux and reinstalled XP. Don’t shoot me; I’d been trying to get wireless working on this laptop for a very long time with no results, and there were a few Windows-only programs I was interested in trying out. Linux is a good idea, it’ll get bigger, but I’d rather stick with XP for now. My family members have enough trouble comprehending how to navigate their way through Windows, and I would like them to be able to use this laptop if required.

(I was going to put “if they need to”, but the ancient rule of “no sentence ending with a preposition” surfaced in mind. I would very much like to exorcise this grammar-daemon that lives in my synapses.)

I made a Wikidot account with the goal of collecting useful information in a central location, but the flexibility of Wikidot was starting to wear on my nerves. You can have a wiki, a blog, a forum, an issue tracker, and lots lots other things. You can modify any of the provided templates as much as you want. Nothing seems to be forbidden; you can even make a little moolah with Google Adsense.

Unfortunately I think their system is a bit too rigorous for my needs; in the end, a simple WordPress blog is best for meeeeee (somebody make a musical out of that), especially one which has such good support for pasting in source code. Wikidot probably has this too, or you can stick it in yourself somewhere; I don’t know. I didn’t check hard enough. I didn’t really feel like making the effort. For me, using Wikidot to host my mostly redundant tidbits was like using a warship to transport a crate of oranges to goodness-knows-whereland.

So I am here again. This time I won’t be so pointless, I swear!

Advertisements

:) -> :( -> :|

14Dec08

I’m feeling a little disappointed.

I’ve been trying to write a script which requires extracting a gzip archive, using something like tarfile in the Python standard library but in Ruby instead. I first looked at zlib, but this was incapable of easily extracting a bunch of files to a directory. It seems to be more suited for compressing small amounts of data to be passed between servers… or so that’s what I’ve read about it. Its documentation is also fairly atrocious as well.

I then remembered Arson and went to have a look at how it solved the problem of gzip extraction. I found out that it used the Archive::Tar::Minitar library that is found in the Facets library. It seemed to do exactly what I wanted. However, I was not happy that the standard library could not do this already. Ruby and Python are constantly compared against each other and it would not serve Ruby well to be lacking in its standard libraries.

So I went about my happy way… until… until I found out that Minitar cannot extract symbolic links.

As per the common lingo, this was a major showstopper. This script absolutely requires that the extraction pick up any symbolic links. I googled for any information about this problem and found that someone had picked up a patch to support this two years ago.

This makes me sad. I want to like Ruby. I like how it is designed. It fits the way I think about problems. Python is good too, but there are things I do not fancy about it. Sometimes I find it confusing. My dislike of Python and fondness for Ruby are partly irrational, I realize, but I still would like to continue using Ruby. If someone was paying me to do this, I might be using something else, but I aim to have fun with programming. I come nowhere close to being a professional programmer; I just enjoy fiddling with code to see what it can do for me.

UPDATE: I’ve realised that simply using the tar command from Ruby through backticks is better for my purposes. Still, I wish there was a Ruby equivalent to tarfile.


I spent most of yesterday and part of the day before writing a screen scraper for Wikipedia with the Hpricot library. I just wrote it to learn Hpricot, as it seemed an interesting and useful library to know — and now, having completed the script, I am a bit stumped as to how to apply it to something bigger. I do not foresee myself doing an awful lot of primitive screen scraping of this exact nature in the future. I am still thinking of a worthy project to devote myself to!

Here it is:

#!/usr/bin/env ruby
# Wikiscraper
# written by mentallaxative
#

require 'rubygems'
require 'hpricot'
require 'open-uri'

class Wikiscraper

  def initialize(arguments)
    #ARGV needs to be cleared else it ends up in 'gets'
    @keyword = arguments.join(" ")
    substitutions = {' ' => '+', '(' => '%28', ')' => '%29'}
    substitutions.each_pair {|a,b| @keyword.gsub!(a, b) }
    arguments.clear
    
    @search_links = []
  end

  def main
    process_search
    populate_arrays
    get_selection
    final_page_processing
  end

  #open our search
  def process_search
    search_uri = "http://en.wikipedia.org/wiki/Special:Search?search=#{@keyword}&fulltext=Search"
    @s_page = Hpricot(open(search_uri))
  end

  def populate_arrays
    #results_data contains all search result names
    results_data = (@s_page/"ul.mw-search-results/li/:not(div.mw-search-result-data)").inner_text

    #search_links is an array of links of search results
    (@s_page/"ul.mw-search-results"/:a).each do |ah|
      @search_links << ah.attributes['href'].sub!('/wiki/', 'http://en.wikipedia.org/wiki/') 
    end

    s_entries = []
    results_data.each_line {|x| s_entries << "#{x}\r"}
    #print out the search results
    s_entries.each_index {|x| print "#{x+1}: #{s_entries[x]}"}
  end

  def get_selection
    toc_string = &#039;<table class="toc" id="toc" summary="Contents">'

    puts "\nType in a entry number:"
    @number = gets.chomp!.to_i-1
    selection = Hpricot(open(@search_links[@number]))

    #gets rid of everything after the table of contents
    if selection.to_s.index(toc_string) == nil
      @wiki_results = selection
    else
      @wiki_results = Hpricot(selection.to_s.slice(1..selection.to_s.index(toc_string)))
    end
  end

  def final_page_processing
    (@wiki_results/"table.infobox").remove
    @text_body = (@wiki_results/"#bodyContent/p/:not(#coordinates)")

    no_article_found = (@wiki_results/"div.noarticletext")

    if not no_article_found.empty?
      puts no_article_found.inner_text
    else
      puts((@wiki_results/"div.dablink").inner_text) #disambig info
      puts "\n"
      #get rid of citation marks like [1], [3], [12]
      @text_body.inner_text.gsub(/\[[\w\d]{0,3}\]/, "").each {|c| puts c}
    end
  end

end

if ARGV.empty?
  puts "Please supply a subject to search Wikipedia for."
else
  scrape = Wikiscraper.new(ARGV)
  scrape.main
end

The script takes all of its command line arguments and joins them into one keyword to search. It then returns a list of search results and asks you to input a number of a search entry you would like to see. Here is an example of it being used:


[colin@workbench wikiscrape]$ ./wikiscrape albert einstein
1: Albert Einstein
2: Albert einstein
3: Albert Einstein College of Medicine
4: Albert Einstein Medal
5: Albert Einstein Award
6: Hans Albert Einstein
7: Albert Einstein Memorial
8: Albert Einstein World Award of Science
9: Albert Einstein's brain
10: General relativity (redirect EINSTEINS THEORY OF GRAVITATION)
11: Albert Einstein Society
12: Albert Einstein High School
13: Albert Einstein Institution
14: Albert Einstein Peace Prize
15: Albert Einstein House
16: Albert Einstein Medical Center
17: List of scientific publications by Albert Einstein
18: Max Planck Institute for Gravitational Physics (redirect Albert Einstein Institute)
19: Albert Einstein in popular culture
20: Albert Einstein: The Practical Bohemian
Type in a entry number:
1
"Einstein" redirects here. For other uses, see Einstein (disambiguation).
Albert Einstein (German: IPA: [ˈalbɐt ˈaɪ̯nʃtaɪ̯n] (Audio file) (help·info); English: IPA: /ˈælbɝt (-ət) ˈaɪnstaɪn/) (14 March 1879 – 18 April 1955) was a German-born theoretical physicist. He is best known for his theory of relativity and specifically mass–energy equivalence, expressed by the equation E = mc2. Einstein received the 1921 Nobel Prize in Physics "for his services to Theoretical Physics, and especially for his discovery of the law of the photoelectric effect."Einstein's many contributions to physics include his special theory of relativity, which reconciled mechanics with electromagnetism, and his general theory of relativity, which was intended to extend the principle of relativity to non-uniform motion and to provide a new theory of gravitation. His other contributions include advances in the fields of relativistic cosmology, capillary action, critical opalescence, classical problems of statistical mechanics and their application to quantum theory, an explanation of the Brownian movement of molecules, atomic transition probabilities, the quantum theory of a monatomic gas, thermal properties of light with low radiation density (which laid the foundation for the photon theory), a theory of radiation including stimulated emission, the conception of a unified field theory, and the geometrization of physics.Einstein published over 300 scientific works and over 150 non-scientific works. In 1999 Time magazine named him the "Person of the Century". In wider culture the name "Einstein" has become synonymous with genius, and he has since been regarded as one of the most influential people in human history.
[colin@workbench wikiscrape]$

This page is relatively narrow so it looks messier than it does in the terminal. I removed some empty lines which were interfering with the code tags, so it is less crammed in as well. The script only prints out the overview of the article, which is enough for me as I find that it is often an adequate summary of a topic.


…and so didst the brave knyght of St. Archibald, riding with his pennants of gold, meet his ende on the typ of the lance belonging to that of the nefarious penguin mercenary, for the knyght had spent many an hour recompiling his helmet from scratch in the great forges, to reinforce it against blade and muscle, such that said knyght had no recollection of the tactics of warfare that led to his defeat…

And thus ends my strange attempt of fusing medieval literature with Linux.

I’ve recompiled many, many packages in Arch to test out the notion that it does little benefit for the system. That notion is mostly true. I decided to try out the Intel C Compiler (icc) for this since I’ve heard it can give better performance. I have compiled gtk2, libx11, Ruby and Pango and various other applications and there isn’t any speed difference you can detect without a stopwatch. Python became slightly faster at one of my Project Euler solutions, but nothing that really makes me go “oooh, I HAVE to recompile this next time”.

As for xulrunner, I sliced off a lot of its extra settings and there is little difference. Here is the mozconfig file I used:

. $topsrcdir/xulrunner/config/mozconfig
ac_add_options –prefix=/usr
ac_add_options –libdir=/usr/lib
ac_add_options –with-system-nspr
ac_add_options –with-system-nss
ac_add_options –with-system-jpeg
ac_add_options –with-system-zlib
ac_add_options –with-system-bz2
ac_add_options –with-system-png
ac_add_options –enable-system-lcms
ac_add_options –disable-system-sqlite
ac_add_options –enable-system-cairo
ac_add_options –with-pthreads
ac_add_options –enable-strip
ac_add_options –disable-tests
ac_add_options –disable-mochitest
ac_add_options –disable-installer
ac_add_options –disable-debug
ac_add_options –enable-optimize=”-march=native -mtune=native -Os -pipe -fomit-frame-pointer”
ac_add_options –enable-default-toolkit=cairo-gtk2
ac_add_options –enable-pango
ac_add_options –enable-svg
ac_add_options –enable-canvas
ac_add_options –disable-javaxpcom
ac_add_options –disable-crashreporter
ac_add_options –enable-safe-browsing
ac_add_options –enable-startup-notification
ac_add_options –enable-profile-guided-optimization
ac_add_options –disable-accessibility
ac_add_options –disable-system-hunspell
ac_add_options –disable-updater
ac_add_options –disable-postscript
ac_add_options –disable-gnomeui
ac_add_options –disable-gnomevfs
ac_add_options –disable-printing
ac_add_options –disable-composer
ac_add_options –disable-dbus
ac_add_options –disable-oji
ac_add_options –disable-vista-sdk-requirements
ac_add_options –disable-parental-controls
ac_add_options –enable-plaintext-editor-only
ac_add_options –disable-jsd
ac_add_options –disable-logging
ac_add_options –disable-mailnews
ac_add_options –disable-ldap

export BUILD_OFFICIAL=1
export MOZILLA_OFFICIAL=1
mk_add_options BUILD_OFFICIAL=1
mk_add_options MOZILLA_OFFICIAL=1

My latest recompiling obsession is the kernel. I have compiled nine of them so far, with varying levels of success! Still not a huge difference. Recompiling Gentoo-style is mostly done for fun and should not be attempted with the expectation that your system will run 10x faster than without.

In other news, I checked out the current progress of Slitaz on VirtualBox on my Windows machine. It has over 1100 packages in its repositories now–a quite significant amount for such a tiny distro! Unfortunately I don’t think I am in a position to install Slitaz on top of Arch. My current Arch installation has built up such a patina of… stuff, that it would be hard to replicate everything I wanted in Slitaz.

Slitaz is still a great distro though. It’s the only one I have had a crack at customising. I shall be very pleased when it releases version 2.0 sometime next year.


I consider the errant arrival of individuals to this blog by way of Google searches like “pygtk error” and “vimperator slow” to be roughly analogous to dolphins being ensnared in a trawler’s net, to be canned along with tuna for the consumption of hungry and dolphin-loving children.

Ok, poor analogy. But it sounded funny in my head.

I’ve been dabbling more in Ruby, probably over the limit of what is considered healthy. More specifically, I rewrote fui in Ruby, just to see how Ruby solves its problems in a Rubyish way. I have named it sel, as in select, with the ect gone on a walk. I think I finally understand how option parsing works, though the structure of it that exists in my head is a very rickety wooden one, prone to disintegration by the winds of confusion.

When I first became interested in programming, I, like a lot of other people, was far too enamoured with the speed of languages. How fast this is one? How slow is that one? The fast ones are unquestionably better ones, thought I, the highly prejudiced, completely blank-slate I. Fast is almost a synonym for good in this blighted modern society–all lanes of a road are the right lane these days. Given my laptop is not a very powerful or new one (1.73ghz 256mb ram), I suppose I can be forgiven for my initial obsession over speed. You guys with duo core systems and +2gb ram–I cast an evil eye upon your hapless selves.

But since then, my thinking has changed. After throwing myself at C and Haskell and Python and Ruby, I can say that I prefer higher-level interpreted languages–great for complete newbies like me, just load up the interpreter and have fun. A program/script written in Python or Ruby is only initially slow, when it has to load the interpreter into memory–after that the speed is actually quite acceptable to me.

My implementation of sel, the Ruby equivalent to fui, is a bit slower than its Python cousin. I say this very guardedly because the scripts are so trivial that benchmarking them is, well, trivial. I found a speed difference of around 0.1-0.2 seconds with selecting, which isn’t a huge performance hit as I had been dreading with Ruby before stepping into it.

Apart from that I also have to wean myself off the tendency to make my code as terse as possible. Sel is currently at around 180 lines of code, and fui is nearly at 300. Sel would be at 200 if I weren’t so zealous. If I go any further I risk obfuscating the code (what fun words one learns when one soaks oneself in a different atmosphere online; obfuscation, twitter, api, unit tests, profiling galore!), and I do not intend on reinventing a modern dialect of hieroglyphics.

Another issue worthy of comment: github. I have split myself between mercurial and git, bitbucket and github. In which lot shall I plant my lowly allegiances? Github is absolutely crawling with Ruby. Bitbucket is less popular, it seems. In any case, I don’t think I will be able to remove either of their respective version control systems from my laptop any time soon. Decisions, decisions….

Oh, by the way, for anyone who is starting to learn Ruby, a good thing to install is fastri. The default ri searches through Ruby documentation very, very slowly and very, very painfully compared to fastri. It’s definitely worth the disk space.

Today’s blog post has been brought to you by a boring/bored person with nothing better to do… see you next time!


My wisdom tooth has stopped giving me grief, and I am pretty happy with myself after successfully writing a tiny script to do something I’ve been meaning to do in a while. I attempted this in Python previously and couldn’t get it to work, which makes Ruby quite impressive to me given I only picked it up recently. For those not familiar with .pacsave files, they are configuration files saved by the Arch Linux package manager when the package associated with them is uninstalled.

#!/usr/bin/env ruby
#
# Finds all .pacsave files and deletes them!

require "fileutils"

$etc_location = "/etc/"


def recurse_check(folder)
  listfiles = File.join(folder, "*") 

  #glob gets all files which match the criteria passed in the argument--in this case, all files in folder
  Dir.glob(listfiles).each do |x|
    if File.file?(x) and File.basename(x).to_s.end_with?(".pacsave") #is it a file and does it end with .pacsave?
      puts "File: #{x}"
      FileUtils.rm(x) #removes .pacsave files! Be careful with this!
    elsif File.directory?(x)
      #puts "Directory: #{x}"
      recurse_check(x)
    end
  end

end


recurse_check($etc_location)

I was a bit surprised to find out that Nautilus is unable to undo its cut/copy/paste actions. It gave me an idea for implementing an undo command in fui. I got a bit bored with hacking fui, probably because my ideas for improving it had dried up. So far, I have added a logging function where fui keeps a temporary file recording all its copying and moving actions. The next step is to properly write an undo function to delete copied files or move files back to whence they came. Oh, and I did forget about exception handling with regards to duplicate files–currently, fui will just fail if you try to copy a file to a directory with a file of the same name. Perhaps I should include an option for replace/skip?

Before I became interested in Ruby (more on that later) I had been playing around with the ncurses module in Python. Uselesspython.com has a good starter script which I played around with a bit. I had a look in this area a few months ago, except I was reading this old tutorial which is rather more difficult for one’s mind to digest. I could write something ncurses-based for fui, but I don’t really see what can be done for it. It is the Fake User Interface, after all. 😉 If I keep adding more to it, it will end up as just another ncurses file manager, which, although nice, is probably not the most useful of applications to write when plenty of good alternatives exist.

So, what happened to Ruby… well, I think I’m just lazy–too lazy to learn something new. I read a lot more tutorials and the concepts in Ruby seem rather neat to me but I miss the familiarity I have with Python. I tried to write a function to detect primes in Ruby, but I don’t think I’ve fully grasped the implications of an object-oriented approach…. I avoid it whenever I can in Python.

Having said that, I’m happy to say that the difference in speed between Ruby and Python is not so great as to make me shirk from Ruby. It’s mostly the same, really.