Typo Detection Part VI: End of the comparisons, for now

This entry is part 6 of 6 in the series Typo detection
Hey. This page is more than 7 years old! The content here is probably outdated, so bear that in mind. If this post is part of a series, there may be a more recent post that supersedes this one.

The Wonderful Wizard of Oz

So, I’ve spent a bit of time comparing my writing’s transition matrix with three classic authors – Hemingway, Baum and Darwin. This, really, was a little bit of an aside: Secretly, I was hoping some sort of magic formula would jump out to tell me how to write better. Unfortunately, if the magic formula is there, I cannot see it.

In terms of what can actually be concluded, I am not really sure I can conclude anything, except the little tit-bits I note below.

Frobenius Distance and Pearson Correlation Coefficient

The Frobenius Norm/ Distance and Pearson Correlation coefficient, both which reduce the comparison of the ‘3d’ transition matrices to single-number values, suggest that my writing is as similar to Hemingway as it is to Baum or Darwin; it has a $\rho_{pearson}$ of 0.8{something} relationship with all three authors. Now, I read somewhere that in SOME fields 0.8 is considered ‘pretty good correlation’. Of course, rumour has it that a banana’s DNA is 70% similar to ours, and 0.8 is closer to 0.7 than to 1.0! Take what you will. Mind you if you look at the values comparing the other three authors, values range from 0.7 to 0.9, so my 0.8 is occupying a sort of middle-ground. So perhaps, based on this, we could say:

1. You cannot really tell whether one author is better/worse than another based on condensing the transition matrix down to a single parameter. I do not know whether this is because too much is lost in this reduction, or whether the information is just not in the transition matrices to start with.
2. If the writing is coherent and understandable, regardless of how well it’s composed and how well it’s written, it’s going to give >0.7ish for a Pearson correlation coefficient when compared to another author’s writing. As I suggested somewhere before, this is hardly surprising. Language construction follows conventions and rules that make it understandable – Grammar. And, the Stanford NLP is written to deconstruct sentences using these rules as the basis for its algorithms.

Scatter Plots and Similarity Heat Maps

The scatter plots are a little overwhelming. I just thought they were an interesting way to look at the comparison, and they helped me understand the Pearson correlation coefficient.

The similarity heat maps: Again, as per the scatter plots: too much information? Maybe if I were a linguist I could ‘use’ this maps to draw analytic conclusions.

The Next Thing-to-Do

The intended aim of this whole exercise was not really to compare my matrix to classic authors though (in the vain hope I would score 0.99 with all of them ;)). The real purpose was to provide some generic data for me to do some sort of cross-validation of my transition matrix. I might still do that (I started writing the code). Actually, when I was writing the points above I started thinking what the results would be if I put my own writing against itself. Would the two transition matrices derived from my own writing be more aligned than mine versus -say- Hemingway?

Actually, come to think of it, that was the whole point of the exercise, I just forgot with my head buried in Frobenius Norms.

But I think I’ll dive back into the original project for the next stages. Spotting dumb typos before it’s too late!