Typo Detection Part IV: Comparing Matrices with The Pearson coefficient.


Hey. This page is more than two years old! The content here is probably outdated, so bear that in mind. If this post is part of a series, there may be a more recent post that supersedes this one.

I cut the last post, where I was looking at ways to quantify the similarity between two matrices, I cut things short – the post was getting too long-winded and math-ey. This one might be a little too (we’ll see)…but hopefully, it’ll be short.

I covered using the matrix norm for comparing my probability matrices in the last post, but I came across another way using a version of Pearson Correlation Coefficient as another single scalar number. I also found that similarity matrices were another way. Since I do not know what I am doing I thought I would put all my options in the wash and see what came out the other end.

Corr2 Correlation

I am calling it corr2 because it a MathLab function called that, part of the image processing suite. The equation for it (from the Mathworks website) is:

I wrote a version in Java here (turns out I needn’t have done, read on).

Reading around on what exactly that formula is/does, I have concluded it’s ‘just’ the Pearson correlation coefficient’ – the ubiquitous ‘r’. Microsoft Excel has the Pearson correlation coefficient, as function PEARSON(), and CORREL(), but more familiar to me, the R² when you do those scatter chart lines-of-best-fit. We all added trendlines to our graphs at school and uni right, and got the R² as the judge of how well things line up, right…which was invariably an indicator of whether we were going to get a good mark. Well, THAT R² is the Pearson correlation coefficient (squared).

This snapshot of a quick excel chart I made is an epiphany for me.

From what I understand(I do not ‘get’ stats and probabilities generally) the Pearson correlation coefficient is some sort of normalised quantification (using the standard deviations to normalise) of how closely there is a one-to-one mapping (the covariance) between the two datasets, in my case the matrices I am thinking about. On this basis it seems to be a prime contender as Miss Right-Way-of-Doing-It for comparing my transition matrices.

‘Incidentally’ footnote: No 1

In this post, Microsoft kind of admit that, pre-Excel 2003, the PEARSON function and a few other statistical functions (in some circumstances) were buggy and ‘a little off’. It’s a strange notion right?

I find ‘improved’ a strange word to be using. “Oh, 2+3 was coming out at 3.5, but we’ve since improved it…it comes out as 4 and a bit now”

‘Incidentally’ footnote: No 2

Karl Pearson on the left, Francis Galton on the right, Darwin’s (half) cousin. The two of them, unfortunately, pioneered a slightly iffy social Darwinism (Galton coined the word Eugenics)

Karl Pearson: It seems a little sad he has been reduced, but not credited, to an ‘r’. He got busy in a lot of fields. Unfortunately, one of those was Eugenics: if there were ever a way to get your name scrubbed out of history books, an interest in Eugenics would be it. Someone did write his biography, maybe one day I’ll read it…

 

 

Series Navigation

<< Typo Detection Part III Comparing Matrices with the Vector DistanceTypo Detection Part V: Comparing Matrices >>