Typos
There is nothing worse than when you spend hours, days (weeks?) writing something and then you revisit it and find it peppered with dumb, and seemingly inescapable, typos.
![](https://i0.wp.com/www.businessballs.com/images/paris_spring_puzzle.jpg?resize=384%2C312)
This is one (a rare one). Right?
It really bugs me:
- If I see those typos in somebody else’s work I automatically think, “sloppy”, and;
- No matter how hard I try my brain just doesn’t see them when I am checking my own words. My brain reads what I meant to write, not what I actually wrote. Perhaps I should be more forgiving when I read other people’s stuff?
Sure there are tools out there to help. Spell checking and Grammarly do their best, but still, things slip through. I am sure there are more options, and at some stage, I will have a look, but at the moment I am kind of swept up in doing-it-my-way 😉
Here is what I have done:
- I downloaded all the posts from this blog (there are 108 excluding this one, and including lots of half-baked drafts I never published). You can do that in WordPress as a WXR file.
- The WXR format (XML) is not so well documented. There is a bit on it here, and I found this Python Gist that was the bare bones of what I wanted: it extracts all (my) written content out of the WXR file and strips out the meta-data. Unfortunately, it’s in Python (not my strongest, but I did a bit at uni so I know the workings) so I had to dust off that cobwebby corridor. I considered redoing it in Java but the XML tools in Java were woeful compared to Python’s stock ElementTree.
- The Gist only went halfway to what I was after though. I just want prose – no headings, WordPress shortcodes etc. Also for my processing, I needed sentences arranged on a per-line basis and all in one file. This Gist here is the outcome. It takes the file and creates a ‘pure’ text file for the next stage. The output was a file with 1968 sentences/lines, all written by me (32200 words) – my writing style in a nutshell (I hypothesise!).
- A while ago I played with the Stanford Parser, a natural language processor. As the only NLP option I knew of (then) I went back to Java and wrote something that parses each sentence in my text file from (3), tokenizes it and works out each word’s “Part-of-Speech” (POS). The Stanford Parser does more than that, it forms text into a tree structure that describes the relationships between words, clauses, sentences etc. The problem I found was that the Parser spits the dummy with large text files of ‘prose’. This kind of makes sense as you are asking the algorithm to essentially make a tree (and so make ‘sense’) of the whole body of text. The Stanford Parser also takes its time…I used a multithreading to speed things up but it still takes 5 minutes to chomp through my 1968 sentences.
- So, I end up with a little tree (shrub?) for each sentence, but the only thing I use is the assigned POS to each word. I go through the trees/sentences and create a transition matrix. For example, every time a noun follows a verb this cell in the 36 x 36 matrix gets incremented by one. I then form a probability transition matrix from this, by dividing each cell value by the total number of transitions. This is all in Java. I plonked the output into Google Sheets.
- Just to visualise the outcome I took to R and made the Heatmap below. The heat map, admittedly, in itself really isn’t that insightful, but I am pretty pleased with it, mainly because of my picking up my programming again.
![](https://i0.wp.com/floatingintheclouds.com/wp_blog/wp-content/uploads/2017/08/h.png?resize=900%2C900&ssl=1)
You will have to google what a ‘model’ POS is (etc etc). Certainly no grammar I know from my latin. To read this heat map, the darker the cell from row to column the more I do that when I write, from one word to the next.
Incidentally, it seems there are POS combos I never use, but one POS I NEVER use (hence the weird formatting glitch in its respective column on the far right: Possessive wh-pronoun (just an overly-technical term ‘Whose’ I think…quite why ‘to’ just got ‘to’ and not dative-or-whatever). I never use the word whose when I write. That is interesting, right? Except now, of course, I have used it twice.
The Next Steps (if/when I get round to them)
- Do some sort of cross-validation on the matrix, using my own (blog) dataset: Is it the matrix consistent with my writing style?
- Do the same thing to others text/ writer. Does my matrix differ markedly from, say, Ernest Hemingway’s? (It will!) And are his matrices consistent across different texts?
- Can I use my matrix to check for typos, which rather depends on the (1) and (2)?