Stylometry and Python–a pipe dream


Statistics Hacks has an interesting entry on the art and science of stylometry. Even before I started studying psychology, I was always interested in the idea that there was some method of empirically measuring the style of a person’s writing. This semester we learnt more about the techniques of principal components analysis and factor analysis, which form the basis of stylometric analysis.

So why did I mention Python? I was fooling around with a small script I found which counted words in a document. I thought, wouldn’t it be amazing if this could be extended to perform stylometric analysis? Say the script could go through and gather up all the instances of words, the lengths of sentences, the types of punctuation used, and feed this into the R statistical language through the RPy binding. R would then perform the hard work of doing the statistics and churn out a result.

My ideas quickly flower into more extravagant and ridiculous forms, and so I envisioned turning this script onto all the writings I’ve done in the past, to find out what common factors existed that could be uniquely identified as “me”. And then I thought, why, wouldn’t it be a great idea if everyone would give me their assignments and essays and I could run the analysis on ALL their work? Then I could perform multiple regression, with the factors as predictors and the mark for their assignment as the criterion. Then I could confidently state something like “Write longer and fewer sentences and statistically you will get a higher mark!”

Of course, this is all a beautiful pipe dream of mine–I only got as far as getting a dictionary of all the words used, with the word as key and the number of occurrences as the value. Elementary stuff. I’ve always been more a dreamer than a doer, sadly….


    Your suggestion would work. And there are computer programs which will score essays for various standardized exams, such as those required by the GRE or SAT. They provide scores which match pretty well with those assigned by real-life human beings who actually read the essay. They use predictors such as sentence length and complexity of sentences (e.g. use of semi-colons). Thanks for plugging the book! 🙂 Bruce Frey

