Python GUI 024 – Wordlists, Part 2

I finally got back to working on the Python TKinter project and got the n-grams generator section implemented. Along the way I encountered a few issues that I need to discuss here first, before showing the code. Most of what follows will be the basis of philosophical decisions that are going to affect later design approaches.

Digression: Lately, I’ve been focusing on writing up the summary of volume 2 of the History of the U.S. Signal Corps. One of the biggest problems the Corps had at the beginning of WW II was that the other services kept on demanding a variety of radio and radar systems, while the General Command refused to provide funding for the research, design, development or manufacture for the majority of them because they didn’t understand the different use cases.

Armored infantry (tanks) needed two different kinds of radios (one for use in the very noisy environment inside the tank, and the other for communicating between the other tanks, and with HQ). The Air Force needed all kinds of stuff, from communications within the airplane, between planes of different makes and models, and ground control, plus receipt of weather information on a dedicated channel, homing beacons, etc. Regular infantry needed mobile telephone and telegraph, hand-held walkie-talkies, and stationary radio for communications with the command center. Paratroopers needed ultra-light handy-talkies for short range communications with each other. Everything kept coming back to use cases – what are you trying to do, and what is needed to accomplish it.

Returning to the Python-TKinter project, what’s the use case?

When I first started writing my software solvers, I only cared about simple substitution CONs (Aristocrats and Patristocrats) in English. I think this is a common mind-state for most beginners. My first reference text for Python was Al Sweigart’s Cracking Codes With Python. His chapter on simple sub solving was not only just for English ciphers, but only worked with a limited 25,000-word dictionary, and hard-coded n-gram lists.

Things were fine while I branched out to other cipher types, like Fractionated Morse, Vigenere and AMSCO, but I slowly realized that I didn’t know how to allocate specific functions to different utility modules, and was instead lumping functions together on a kind of ad hoc basis.

It was when I began tackling xenocrypts (CONs in other languages) that I really got in over my head. First, I needed to get wordlists from somewhere, and initially that “somewhere” was the ACA’s download archives. Those lists were out of date, and probably created at the early stages of ASCII code table development, when the availability of non-English character sets were more limited. The files contain all kinds of weird character combinations that no one understands any more. One example in the German file is the use of A” to represent Ä. I don’t know what most of the other character combinations mean.

I mentioned in Wordlists, Part 1, that certain languages have alphabets that don’t match English well and are handled in a specific way, including Swedish and Norwegian. Plus, there are other rules for the French letters like ÉÈÊ, and Spanish Í. The question becomes, what are those rules, and how do they impact the generation of non-English wordlists and n-gram lists?

Let's start with English.
con = "I can't say there are many things I regret, but this is one of them! *General G.A. *Custer, 1876

To get the individual words, I want to remove all of the punctuation, numerics, and non-alphabetic characters (like newline). Simply deleting “.” will turn “G.A.” into “GA”. Instead, it’s better to replace it with a space (” “). But, turning the apostrophe (‘) into a space turns “can’t” into “can t”. So, I need two lists, one for simply deleting characters, and the other for turning them into spaces.

What complicates matters is that shift-7 in notepad gives me ““, while in Word I get ““. Same issue for shift-2: ‘‘ compared to ‘“”‘. The lists of characters to address may depend on the text or word editor used to author the CON. I’ll start with the following:

to_delete = "'’"
to_space = '\n#,.!?¿*+"();:–-—“”/1234567890$‘=…_'

Is this a complete list? Probably not. So I need a check for all the characters still in the plaintext using a legality list, which I want to be all uppercase, but has to include the space.

legal = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ '

I’ll print out anything not in the legal list, and decide what to do with it as I go along.

After all this, I’m left with multiple spaces in my string, and I need to make sure the words are separated by single spaces. I can do that as follows:

while ' '*2 in con:
... con = con.replace(' '*2, ' ')

I’m using ‘ ‘*2 instead of two spaces in the quotes because WordPress deletes multiple spaces.

Now, I can make a list of the words in my CON by using:

con_list = con.split(' ')

Printing out con_list, I get:

I
CANT
SAY
THERE
ARE
MANY
THINGS
I
REGRET
BUT
THIS
IS
ONE
OF
THEM
GENERAL
G
A
CUSTER

Notice that “I” appears twice, and “G” is not a valid single-letter word. One way of eliminating duplicates would be:

wordlist = []
for word in con_list:
... if word not in wordlist:
....... wordlist.append(con)

After this, I have to decide what words have the highest likelihood of appearing in future CONs, and add those to the full English wordlist, for speed purposes. This is a manual process, where I use a listbox widget to hand select only those words I want to keep. That’s where I’d weed out “G” and “CUSTER.”

One use case for the English alphabet is in the encipherment-decipherment process for simple substitution. We have two alphabets, one for the plaintext message, and the other for the ciphertext message.

abcdefghijklmnopqrstuvwxyz - Plain
XYZCIPHERABDFGJKLMNOQSTUVW - Cipher

While this is not important for the English wordlist, it is a factor when it comes to the other languages. What’s useful to note here is that the cipher alphabet is restricted to the 26 English letters.

Moving on to the other languages

Italian:
The Italian alphabet is probably one of the most straightforward to deal with in a xeno. It uses 21 of the 26 English letters, without the introduction of other letters. In order to maintain a one-to-one relationship between the plain and cipher characters, we’ll use the full 26 English alphabet for Italian CONS.

legal = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ '
abcdefghijklmnopqrstuvwxyz - Italian plain
ABCDEFGHIJKLMNOPQRSTUVWXYZ - Generic cipher

Latin:
Latin has 23 letters, but is fundamentally the same as Italian.

legal = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ '
abcdefghijklmnopqrstuvwxyz - Latin plain
ABCDEFGHIJKLMNOPQRSTUVWXYZ - Generic cipher

German:
The standard 26 English letters, plus Ä Ö Ü ẞ.
The ACA’s approach to standardization is to replace ẞ with “ss”, Ä with”A”, Ö with “O” and Ü with “U”. Ä Ö Ü ẞ may still appear in the plaintext solutions on the ACA website, but generally not in the ciphertext. Therefore, any search on German words in a list needs to be able to find Ä Ö Ü. Additionally, if you want to use Google translate to get the English equivalent of the German plaintext, you’ll need to have Ä Ö Ü in your solution as well.

legal = 'ABCDEFGHIJKLMNOPQRSTUVWXYZÄÖÜ '
abcdefghijklmnopqrstuvwxyz - German plain
ABCDEFGHIJKLMNOPQRSTUVWXYZ - Generic cipher

French:
This is where the philosophical questions start arising. French has the standard 26 English letters, plus some diacritics and ligatures. So far, most of the special stuff has not shown up in the CONs from 2023. I haven’t encountered Æ or Œ at all, and I assume they’ve been split into plain “AE” and “CE.” What I have found are É È Ê Ç, which apparently just get substituted with E and C.

What’s an exception to the other languages so far is the use of ‘ at the start of words, such as “l’école” or “s’il vous plaît” (based on some cursory reading, “l'” is the short form of “le” or “les” in front of nouns that start with a vowel). Should the ‘ be used as a separator, with “l” being discarded and “école” going into the wordlist on its own? Should “l’école” go into the list as-is? Or, as “lécole,” because that’s how it would show up in a Patristocrat? (The ACA French list contains words in the form “l’école,” but “école” itself isn’t in there.

At the moment, I’m keeping the ‘ as a legal character.

legal = "ABCDEFGHIJKLMNOPQRSTUVWXYZÉÈÊÇ '"
abcdefghijklmnopqrstuvwxyz - French plain
ABCDEFGHIJKLMNOPQRSTUVWXYZ - Generic cipher

Spanish:
Spanish is also problematic. The ACA wordlist predates the 2010 publication of the latest version of the Ortografía de la lengua española. Of the CON plaintexts in Spanish, the only non-English letter has been Í. Everything else appears to be substituted (such as N for Ñ).

legal = 'ABCDEFGHIJKLMNOPQRSTUVWXYZÍ '
abcdefghijklmnopqrstuvwxyz - Spanish plain
ABCDEFGHIJKLMNOPQRSTUVWXYZ - Generic cipher

Esperanto:
Esperanto shares a lot of aspects with Spanish. As such, they have similar rules for letter substitutions for the ciphertext alphabet.

legal = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ '
abcdefghijklmnopqrstuvwxyz - Esperanto plain
ABCDEFGHIJKLMNOPQRSTUVWXYZ - Generic cipher

Portuguese:
From a simple character viewpoint, Portuguese has the standard 26 English letters, plus some accent marks and the tilde (~). The ACA standardizes the alphabet to the English letters.

legal = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ '
abcdefghijklmnopqrstuvwxyz - Portuguese plain
ABCDEFGHIJKLMNOPQRSTUVWXYZ - Generic cipher

Afrikaans:
AÁÄBCDEÉÈÊËFGHIÍÎÏJKLMNOÓÔÖPQRSTUÚÛÜVWXYÝZ
The ACA’s alphabet simplifies this to the 26 standard English letters. So far, there’s been only one or two Afrikaans CONs in the time period I’m collecting data for (Jan. to Aug. 2023), so my sample size is limited to about 20 words.

legal = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ '
abcdefghijklmnopqrstuvwxyz - Afrikaans plain
ABCDEFGHIJKLMNOPQRSTUVWXYZ - Generic cipher

Dutch:
Here we get lucky. Following the Spelling Act of 15 September 2005, Dutch has the standard 26 letters, while the digraph “ij” sometimes acts as a single letter. In this respect, Dutch and English wordlists follow the same rules.

legal = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ '
abcdefghijklmnopqrstuvwxyz - Dutch plain
ABCDEFGHIJKLMNOPQRSTUVWXYZ - Generic cipher

Norwegian:
Now we start getting into more specific use cases. Norwegian has the standard 26 letters, plus Æ Ø Å. I haven’t encountered Æ in a CON plaintext yet, but Ø and Å are relatively common. The ACA rule is to replace three of the less common English letters with the three specialty characters, and to show which three in a note like so:

[26-Alph. *wxz]

This states that the CON uses a 26-letter alphabet, as normal, but that W, X and Z are replaced by Æ, Ø and Å. These letters will be at the end of the regular letters, in a specific order. There’s no actual guarantee for which three letters will be replaced, so it’s important to verify them before starting to solve the CON. However, regarding the wordlist, all 29 letters are legal.

legal = 'ABCDEFGHIJKLMNOPQRSTUVWXYZÆøÅ '
abcdefghijklmnopqrstuvyÆøÅ - Possible Norwegian plain
ABCDEFGHIJKLMNOPQRSTUVWXYZ - Generic cipher

Swedish:
Like Norwegian, Swedish has 29 letters – the standard 26, plus Å, Ä and Ö. The difference is that generally the note will contain:

[26-alph. *qwz]

Again, *qwz is not guaranteed and it’s important to check before starting to solve the CON.

legal = 'ABCDEFGHIJKLMNOPQRSTUVWXYZÅÄÖ '
abcdefghijklmnopqrstuvyÅÄÖ - Possible Swedish plain
ABCDEFGHIJKLMNOPQRSTUVWXYZ - Generic cipher

Full use cases:

I’m using the wordlists as follows:
1) Pattern word matching
Given a cipher word like GLBBLT (ABCCBD), list all words with the same pattern.

2) Finding words that match specific letter positions
If the solver has uncovered enough of the plaintext letter assignments to reveal “L_TT__”, list all words with the same matching spelling.

3) Generating n-grams
Run through the wordlist and generate lists of the top 2-letter, 3-letter and 4-letter groupings (i.e. – er, re, ed, on, the, ion, ist, etc.)

Caveats:

Use case 1 may require that special letters, like É, be replaced by their standard equivalent (E), prior to performing pattern matching.

Use case 2 may require that both the special letter and its replacement be displayed. The replacement may be needed for solving the CON, while the actual letter would be used for submitting the solution for credit, and for aiding Google translations.

Use case 3 is similar to case 2. The n-grams are used to automatically check whether a software solution is correct or not. Solutions containing larger numbers of 2-, 3- and 4-grams are more likely to resemble sentences in that given language. But, if the cipher text uses “special letter suppression,” the n-grams will need to as well.

To-do list:
1) Devise a substitution dict object for use in generating n-grams.
Something like:

sub_dict = {‘É’: ‘E’, ‘È’: ‘E’, ‘Ê’: ‘E’, ‘Ç’: ‘C’}

Next up: n-grams Part 1.

Published by The Chief

Who wants to know?

Leave a comment

Design a site like this with WordPress.com
Get started