Python GUI 025 – N-grams, Part 1

Ok, what are n-grams, and why do we care?
A bigram (or digram) is a pair of letters that appear together. A quick glance at the previous sentence shows that “er” shows up twice, as does “gr.”

A trigram is a group of three of adjacent letters, like “the” or “ear.” Labeling can continue up from 4-gram, 5-gram, and so on.

While the easiest way to get a collection of common English n-grams is to search the net, the questions are “where can I get n-gram data for other languages,” and “how many are ‘enough'”? The alternative is to build up your own list from a wordlist, but we’re still left with “how many do I need?”

There’s not a lot of point in analyzing individual letters – in English, you’d want to include “a” and “i”, but every other letter could appear as someone’s initials, so that’s just a distraction. My approach is:

... ret = []

... for i in range(ngram_cnt_max - 1):
....... gram_len = i + 2
....... d = {}

....... for word in wordlist:
........... if len(word) >= gram_len:

............... for ptr in range(len(word) - gram_len):
................... gram = word[ptr:ptr + gram_len]
................... if gram in d: d[gram] += 1
........... else: d[gram] = 1

....... temp_list = []
....... short_list = []
....... for gram in d:
........... temp_list.append([gram, d[gram]])
....... temp_list.sort(key = sort_second, reverse = True)

....... smallest_max = ngram_no_max
....... if len(temp_list) < smallest_max:

........... smallest_max = len(temp_list)
....... for j in range(smallest_max):
........... short_list.append(temp_list[j][0])

....... ret.append(' '.join(short_list))

def sort_second(val):
... return val[1]

What this does
Say we want data for 2-, 3- and 4-grams. “i” is going to range from 0 to 2. gram_len is then going to go from 2 to 4. Stepping through each word in our wordlist, run through each word in segments of length gram_len. If the segment doesn’t already exist in dict d, add it and initialize the count to 1. Otherwise, increment the count for that segment by one.

When we’ve finished this run, copy the dict data into the list temp_list. Sort temp_list on reverse count. Finally, copy the top “x” n-grams to ret as a space-separated string, and go to the next value of “i“. “X” in this case is set by smallest_max, which is the top number of each n-gram set we want to collect, which could be the top 20, top 100 or top 600.

If smallest_max is 20, then our n-grams file (when we save it to a file) may look like:

IN RE TI ER TE AT EN AN AR ST
ON RA EA RI CO OR OU LI IO RO

TIO ATI ENT CON STA PRO RES ATE PRE STE
IGH TER COM REA ION OUN CTI EVE TUR TIN

ATIO TION CTIO COMP ENTI PRES STAN VENT INTE CONT
OLOG OUNT IGHT ATUR TATI ECTI SPEC IENC THOU THIN

So, what do we do with this?
Let’s go with a simple use case. First, pick an example quote.

"You will never reach your destination if you stop and throw stones at every dog that barks. Winston Churchill"

Pick a 9-digit number and write the quote under it without spaces or punctuation.

693174258
---------
Youwillne
verreachy
ourdestin
ationifyo
ustopandt
hrowstone
sateveryd
ogthatbar
ksWinston
Churchill

Write the columns out in number order and group in 5s.

WRDOO WEHIR LCTFN ORBTI URRIT OTTWU
LASIA TETSH NHIYD NYAOL YVOAU HSOKC
IEENP SVANC EYNOT EDRNL OEUTS RAGSH

Software solvers are good at bruteforce attacks on the key for transposition cipher types, just as with cryptarithms, so we could easily create a loop to decipher this message with each possible key arrangement. But that would leave us with thousands of results to wade through to find the one key that produces the results we want. Instead, we can count the number of n-grams in the text and then just look at the results with the biggest counts.

The question becomes “how many n-grams should we use, and is there an advantage to using just 2-, 3- or 4-grams, or some combination?

If we save the n-grams to a file as space-separated, newline delimited, then reading the file and using

grams_lines = fstr.replace('\n', ' ')
grams_lines = fstr.replace(' '*2, ' ')
grams = grams_lines.split(' ')

will put all of the n-grams in one big list, and counting is then easy.

cnt = 0
for g in grams:
... cnt += cipher.count(g)

print(cnt)

Testing our cipher, with just 20 of each n-gram, we get a count of 14.
Testing our plaintext message, we get 86.

Using my original Python columnar solver (which isn’t TKinter-enabled yet), what I want to know is, what’s the execution time to test every key permutation, and what are the top five count results?

Run time: 23 seconds

Count: 86 Key: 256931748
LNYOUWILECHVERREAYTIOURDESNFYATIONIONDUSTOP
ATONHROWSTERYSATEVEDBAOGTHATRTOKSWINSNILCHURCHL

Count: 84 Key: 693174258
YOUWILLNEVERREACHYOURDESTINATIONIFYOUSTOPAN
DTHROWSTONESATEVERYDOGTHATBARKSWINSTONCHURCHILL

Count: 83 Key: 125689374
WLNYEOUILRCHVYEREADTIONURESOFYAOTINIONDUTST
PAWONHEROSTERYSDATVEHBAORGTATITOKNSWNSRILCLHUCH

Count: 83 Key: 125869374
WLNEYOUILRCHYVEREADTINOURESOFYOATINIONDTUST
PAWONEHROSTERYDSATVEHBAROGTATITONKSWNSRILLCHUCH

Count: 83 Key: 693174852
YOUWILENLVERREAYHCOURDESNITATIONIOYFUSTOPAT
DNHROWSTENOSATEVEDYROGTHATRABKSWINSNOTCHURCHLLI

Note that while the key 693174258 gave the correct answer, with a gram count of 84, it wasn’t actually the highest scoring solution. This should act as a warning to not simply take n-gram counts at face value.

Now, how about 50 n-grams each?

Run time: 36.7 seconds

Count: 132 Key: 693174258
YOUWILLNEVERREACHYOURDESTINATIONIFYOUSTOPAN
DTHROWSTONESATEVERYDOGTHATBARKSWINSTONCHURCHILL

Count: 125 Key: 869317425
EYOUWILLNYVERREACHNOURDESTIOATIONIFYTUSTOPA
NDEHROWSTONDSATEVERYROGTHATBANKSWINSTOLCHURCHIL

Count: 120 Key: 174258693
WILLNEYOUREACHYVERDESTINOURONIFYOATIOPANDTU
STWSTONEHROEVERYDSATHATBAROGTINSTONKSWRCHILLCHU

Count: 118 Key: 931742568
OUWILLNYEERREACHVYURDESTIONTIONIFYAOSTOPAND
UTROWSTONHEATEVERYSDGTHATBAORSWINSTOKNHURCHILCL

Count: 116 Key: 931742586
OUWILLNEYERREACHYVURDESTINOTIONIFYOASTOPAND
TUROWSTONEHATEVERYDSGTHATBAROSWINSTONKHURCHILLC

We’re starting to see two things happen. We’re getting much more of a clear, correct winner. But we’ve increased the run time by nearly 60%. Just for amusement, let’s look at the ratio difference between the n-gram counts for the correct and the nearest incorrect answer.

84 - 86 / 84 = -0.02
132 - 125 / 132 = 0.07

Well, for a 60% hit on run time, we got a 0.09 improvement on “accuracy.”

The next step is to see which has the bigger impact on n-gram count – the 2-, 3- or 4-grams?

Top 50 n-grams of each width:
n-gram .. sec .. high cnt .. top is correct?
. 2 ..... 26 ..... 78 ........ yes
. 3 ..... 25 ..... 42 ........ yes
. 4 ..... 27 ..... 29 ........ no

This is to be expected. There should be more smaller good segments than longer ones, and there are. However, combining the n-gram lengths (testing for 2-, 3- and 4-grams all at the same time) does give us a greater assurance that the higher score will go to the solution that’s actually better English.

All that really remains is to verify the hits on run time for 100 n-grams, and 200 n-grams.

n-gram .. sec .. high cnt .. ratio .. correct?
100 ..... 51.5 ... 207 ...... 0.09 .... yes
200 ..... 83.6 ... 267 ...... 0.14 .... yes

Conclusions
First,
total test run time is dependent on the length of the cipher message. The longer the message, the longer it will take to do the counts.

Second, the content of the message will also impact the accuracy of the results. If the message was constructed to exclude the letter “e”, that’s going to throw out a good 20% (more or less) of the n-grams right there.

Third, the quality of the wordlist is going to affect the quality of the n-gram lists. Right now, I only have about 2,800 English words built up so far. It’s even worse for Esperanto, which has less than 20 words. The longer I solve CONs, and the more words that get added, the more accurate the n-gram collections are going to be.

Fourth, run time. We’re only talking about differences in seconds right now for a simple transposition, and Transpo solver is expected to have no more than 10 different kinds of transpositions implemented at this point. On top of which, half of those types rarely appear in the Cm newsletter. I can afford to wait a total of maybe 5 minutes to have Transpo run on those types that do appear.

Fifth, accuracy versus check time. One thing I can do in the solver is to keep a running top score. That is, every time a solution produces a higher count than previously, I can update the top count variable, and add the new solution to a list var. And, for safety’s sake, I can add any solutions within 5 points of the current top count. At the end, I can reverse sort the counts and just display the top 5 or so solutions in a listbox widget, which I can then click on to say which is the right solution.

That means I can use a smaller n-gram set (the top 20 or top 50 of each width) to benefit from the faster run time, and still solve most of the transposition ciphers in under a minute each.

One thing to note here is that transpositions will have an ultimately correct key number within the keyspace, so that if I test the entire keyspace (say all numbers between 1 and 10,000), I will get the right key number at some point, regardless of how long the solver takes to run. This is opposed to substitutions, which can have much larger keyspaces that may theoretically take forever to map out. That is, my solvers may not actually find the correct solution, but there may be enough of part of a solution to help let me finish the job manually.

Anyway, I’ve found that most transposition types are solvable by bruteforce within a minute or two for periods up to 10 or 11. After that, they can start taking as much as an hour for the next width up (11 or 12). At that point, it’s better to switch to hillclimbing, which will also need to be coupled to n-gram counting.

Lastly, I can collect the n-grams for other languages, but it’s definitely looking like I’ll need to implement the suppress lists for German, French, Esperanto, etc., to convert characters like “Í” to “I” before counting the segments in each word.

[Edit:
I discovered a thing I should have known about a long time ago.

I was trying to solve a particularly tricky Patristocrat (spoiler, I failed), and I’d thought that increasing the number of n-grams would give me better results in finding the correct solution. So, I kicked the maximum number up to 600 for each of the 2-, 3- and 4-gram sets. I was surprised to discover that the 2-grams set only had 421 letter pairs, and it took me a while to realize what had happened.

If we look at a 2-gram as having the components a+b, and say that a and b can take on the value of any of the uppercase English letters, then that means there are only a total 676 possible combinations of 2 letters. In other words, by specifying a maximum of 600 2-grams, I’d exceeded the number of “legal” 2-grams I could reasonably expect from a wordlist containing 2,800 words.

This also puts a practical cap of 676 on the 2-gram list, which completely negates the usefulness of 2-grams. That is, testing for every single 2-letter combination is a waste of computing time since things like “jx” and “qq” don’t occur normally, and the frequency of “ox” is so far down the chart as to be ignorable.

Again, 50 is a good cutoff for 2-grams, and you don’t normally gain much for anything above a ceiling of 200.]

Next up: cons_parser_update_wordlists.py.

Published by The Chief

Who wants to know?

Leave a comment

Design a site like this with WordPress.com
Get started