Investigating Benford’s Law with Lua

20Oct08

I’ve been shifting my files around to organise them more efficiently. One trouble spot I had was with the Project Euler folder, which has a couple of small script files named problem12.py or problem3.lua and the like. I wanted to rename all of them without “problem” in their name, so I tried writing a python script to do it for me.

Unfortunately I got about as far as generating a list of files that needed to be renamed… and for some reason the script wouldn’t rename them. They don’t really need to be renamed; I just thought it would be a good exercise. I got pretty frustrated reading all the documentation and gave up for the night.

Then the next day (today) I considered trying the same thing with Lua. I was getting bored with just renaming files at that point, so I was not terribly motivated to make it work. And that’s when I started reading the Statistics Hacks book I borrowed out from the library, and recalled Benford’s Law.

In brief, it states that the first nonzero digit in lists of numbers from real-life sources follows a predictable, downward distribution. The number 1 occurs roughly a third of the time, 2 about 0.17, and decreasing up to 9. Read the wiki article if you like; it explains it so much better.

The Statistics Hacks book mentioned an example where the first nonzero digit for file sizes was obtained for a Apple Powerbook. I wanted to achieve the same thing by writing a Lua script.

I took the excellent example script from the Lua filesystem homepage and proceeded to modify it.

#! /usr/bin/env lua
--Investigating Benford's Law for file sizes
--To change target folder, look at benford()

require "lfs"
require "os"

disttable = {}
for x = 1,9 do
     disttable[x] = 0
end
total = 0


function benford (path)
    for file in lfs.dir(path) do
        if file ~= "." and file ~= ".." then
            local f = path .. '/' .. file
            local attr = lfs.attributes (f)
            if type(attr) == "table" then
                if attr.mode == "directory" then
                    benford (f)
                elseif attr.mode == "file" then
                    local singval = (string.sub(attr.size, 0, 1))
                    total = total + 1 
                    if singval == "1" then
                        disttable[1] = disttable[1] + 1
                    elseif singval == "2" then
                        disttable[2] = disttable[2] + 1
                    elseif singval == "3" then
                        disttable[3] = disttable[3] + 1
                    elseif singval == "4" then
                        disttable[4] = disttable[4] + 1
                    elseif singval == "5" then
                        disttable[5] = disttable[5] + 1
                    elseif singval == "6" then
                        disttable[6] = disttable[6] + 1
                    elseif singval == "7" then
                        disttable[7] = disttable[7] + 1
                    elseif singval == "8" then
                        disttable[8] = disttable[8] + 1
                    elseif singval == "9" then
                        disttable[9] = disttable[9] + 1
                    end
                end
            end
        end
    end
end


benford("/home/colin")

for i,v in ipairs(disttable) do
    print ("Probability of " .. i .. ": " .. v/total)
end

In this particular case, running the script prints out the probabilities of each digit in the first digit of each file size for the folder /home/colin. It is recursive and so will count every file in every folder in it. The long bit with singvalue and disttable is ugly, I know, but I don’t know how to get Lua to recognise the number argument as a key as well as a value.

The output:

Probability of 1: 0.361879031184
Probability of 2: 0.13270800433798
Probability of 3: 0.1128826651953
Probability of 4: 0.078235887288571
Probability of 5: 0.068037824159516
Probability of 6: 0.062329953005194
Probability of 7: 0.065164862345174
Probability of 8: 0.059761410985749
Probability of 9: 0.057230921440667

I ran the script for various locations, such as /usr and /var/cache/pacman. I also tried running the script for the whole file system, /, but after four hours of mindless computation I concluded that it was either not worth the effort or something was wrong with the script.

As you can see, the probability decreases with higher digits. The more data you have, the better it conforms to Benford’s Law. Having a very skewed sample (eg. only including your folder of holiday snapshots) will not produce the characteristic distribution. I didn’t expect my pacman cache to follow the distribution, it being a folder containing only compressed files, but it did.

I find it an impressive and rather startling law. Apparently it shows up in any data obtained from counting something, and isn’t affected by multiplying the numbers or changing their bases. I’m also very impressed with the script I wrote. Go little Lua script go! 🙂

Advertisements


%d bloggers like this: