Stylometry to determine authorship – Lennon or McCartney

Stylometry refers to the use of statistical techniques to determine who wrote an article, letter, or other text. We can also use stylometry to find out who painted a fine-art painting or composed a melody or song. Thanks to stylometry, US law enforcement agencies found out that Theodore Kaczynski was the Unabomber. The Unabomber was an American domestic terrorist who is currently serving eight life sentence for using bombs. He also killed three people.

Jason Brown, professor of mathematics at Dalhousie University, and Mark Glickman, senior lecturer in statistics at Harvard, both have a mutual Beatles passion.

It was this shared passion that led the two professors to wonder whether stylometry might solve the Lennon or McCartney questions. In other words, could it tell us whether John Lennon or Paul McCartney wrote specific Beatles songs?

Stylometry - Lennon or McCartney
Stylometry helped the researchers determine whether Paul McCartney or John Lennon wrote some ‘unknown’ or ‘disputed’ Beatles songs.

Who wrote these Beatles songs?

According to Prof. Glickman, we know which of the two wrote most of the Lennon-McCartney songs. However, there are many songs or portions of songs which have disputed authorship.

Among the ‘The 500 Songs of All Time,’ the 1965 album ‘Rubber Soul’ is number 25. One of the songs in the album is ‘In My Life.’ Nobody knows who wrote that song. McCartney has one memory of it, while Lennon, on the other hand, had a different memory.

Prof. Glickman said:

We wondered whether you could use data analysis techniques to try to figure out what was going on in the song to distinguish whether it was by one or the other.”

Stylometry for Beatles songs

With the help of Ryan Song, a former Harvard statistics student, they ‘decomposed’ each Beatles song into five representations. They focused on songs from 1962 to 1966.

In each representation, they focused on how often a set of musical features were present in each song.

Prof. Glickman explained:

“The basic idea behind our approach is to convert a song, whose musical content is difficult to quantify in any direct way, into a set of different data structures that are amenable for establishing a signature of a song using a quantitative approach.”

“Think of decomposing a color into its constituent components of red, green and blue with different weights attached.”

“We’re doing the same thing with Beatles songs, though with more than three components. In total, our method divides songs into a total of 149 constituent components.”

Stylometry in music

First representation

The first representation looked at how often certain chords appeared. It also included an aggregation of uncommon chords.

The team eventually formed eleven different chord categories.

Second representation

They subsequently characterized melodic notes.

Third representation

Third, they determined how often cord transition occurred. That is, how often does one chord follow another chord? In this representation, they also aggregated uncommon chord transition into the categories.

Fourth representation

Fourth, they determined how often specific melodic note pairs occurred.

Fifth representation

Fifth, they decomposed each song into four-melodic ‘note contours.’ In this context, a ‘contour’ is a four-note melodic sequence which either rises, falls, or stays the same.

By examining four-note contours, they added extra detail that could help distinguish styles of composition.

Lennon and McCartney styles

The five representations served as signatures of different musical compositional styles. The authors of Beatles songs, for example, had distinct styles. John Lennon typically composed melodic lines that varied very little.

Regarding the songs ‘Help’ and ‘Michelle,’ Prof. Glickman said:

“Consider the Lennon song. It basically goes, ‘When I was younger, so much younger than today,’ where the pitch doesn’t change very much.”

“It stays at the same note repeatedly, and only changes in short steps. Whereas with Paul McCartney, you take a song like ‘Michelle,’ and it goes, ‘Michelle, ma belle. Sont les mots qui vont très bien ensemble.’ In terms of pitch, it’s all over the place.”

Beatles stylometry – three steps

The researchers’ used a three-step approach to infer disputed authorship from musical traits.

Step 1

Their model suggests that how often each of the 149 features within a song is present depends on its author.

For example, the root chord of a song (tonic) occurs with one frequency in Lennon compositions. However, in McCartney compositions, the tonic may have a different frequency.

Step 2

The team used ‘Bayes rule’ to reverse probability. ‘Bayes rule’ is a common probability tool. The American Statistical Association explains:

“In other words, starting with the frequency of the 149 musical features knowing a song’s author, they determine a model for the probability Lennon or McCartney wrote a song given the frequency of the 149 musical features.”

“This model was then trained using 70 Lennon-McCartney songs or song portions in which the authorship was truly known.”

Step 3

They applied this model to ‘disputed authorship’ songs. Specifically, Lennon and McCartney songs. This subsequently resulted in probability predictions for compositions of unknown authorship.

Prof. Glickman said:

“So, the probability that ‘In My Life’ was written by McCartney is .018, which basically means it’s pretty convincingly a Lennon song.”

McCartney remembers differently. However, ‘The Word,’ which Glickman was almost sure was a Lennon composition, turned out to be almost definitely a McCartney song.

Is this exercise simply a ‘musical whodunnit’ one? Or is there more to it?

Prof. Glickman said:

“Yes (there is more to it). This technology can be extended. We can look at pop history and chart the flow of stylistic influence.”


The researchers will be presenting their stylometry study and also their findings at JSM 2018 in Vancouver, Canada. JSM stands for Joint Statistical Meetings.

Title: Assessing Authorship of Beatles Songs from Musical Content: Bayesian Classification Modeling from Bags-Of-Words Representations,” Mark Glickman, Jason Brown, and Ryan Song. Abstract #329336. JSM 2018, Wednesday, August 1, 2018 : 10:30 AM to 12:20 PM.

The source of this article is the American Statistical Association.

What is statistics?

The term ‘statistics’ has two meanings:

1. In singular form, it is a discipline or science. For example, geography, statistics, economics, and biology are disciplines or sciences. I might say: “Statistics is an interesting subject.”

Statistics is the science of gathering and analyzing numerical information in very large quantities.

2. In plural form, it refers to the numbers. For example, I might say: “Statistics are suggesting that the economy is about to turn.”