Python GUI 021 – Digression 2 – Word Lists

Real life is really starting to get in the way of what I want to do. There’s nothing really good or really bad going on, but the constant interruptions are preventing me from getting stuff done. At the top of the list, my ASUS tablet died. At least part of the problem might be a broken pin on the USB jack in the tablet itself, but it’s also draining the battery within an hour or two. The tablet’s at least 5 years old, and it’ll probably cost close to the price of the machine to get it fixed. I looked at a few of the new tablets at the nearby Bic Camera superstore, and ASUS wasn’t in the line-up this time. The prices are also unbalanced – a few brands were around $300, and the rest were over $750. I haven’t been able to look at any of them more closely, but with luck maybe I’ll get a few free minutes for that on Sunday. I’ll also try running (literally; they’re only a mile from me) over to Yamada Electronics and see what they have.

I considered getting a notepad PC recently, but the used shop nearby apparently sold the ones they had on the shelves. I’d like to have something I can type up book reviews on in a coffee shop, but I also need a Mac, or an Android-compatible device for running the Unlock! app on for playing escape room games. If push comes to shove, the tablet comes first.

Regardless, without a tablet I can’t read cryptography books during my breaks or on the streetcar. I have two reviews left in the back log, and volumes 2 and 3 of the Signal Corps histories are huge – 600-800 pages each. Even when I get a new tablet up and working, it’s going to take me at least a week or two to finish reading and summarizing vol. 2.

One of the “interruptions that isn’t” was solving the ciphers in the July-Aug. 2023 issue of the ACA’s Cryptogram (Cm) newsletter before the Oct. 31 deadline. Again, it was a toss-up between working on the Transpo Solver, and trying to get credit for as many of the Cm SOLs (solutions) as I could in the shortest amount of time possible. I suddenly found myself with a few uninterrupted hours on a couple days in a row, and I sat down and tackled the Aristos and Pats (simple substitutions both with and without word breaks). By the end of the first night, I’d also worked through about half of the other ciphers (Playfair, Fractionated Morse, etc.) in the Cipher Exchange (CE) department. The second night, I finished off the remaining CE CONs that I could do with my existing Python solvers, and the first nine Xenos (xenocrypts; ciphers in other languages), which were also mostly Aristos and Pats.

The only Xeno I couldn’t solve on day two was an Aristo in Esperanto, because I didn’t have an Esperanto word list. Day three was spent addressing that issue, and I surprised myself at solving that one at the end of the night, too. Then I needed about an hour to copy-paste my solutions and recovered keywords into CONs Parser. After that, I exported the data to an email to send to the ACA for credit on 77 CONs, and I was happy.

The thing is, no matter what I work on, there seems to be a way to tie it to the Python-TKinter tutorial series somehow.

So, what is a word list, and why would we want more than one?
In the simplest sense, a word list is a list of all of the words in a given language, most likely in alphabetical order, one word per line. In this case, if you plan on working on the ACA’s Xenos, then you’d want a list for each language you intend to tackle. At this time, I have partial lists for Afrikaans, Danish, Dutch, Esperanto, French, German, Italian, Latin, Portuguese, Spanish and Swedish, which covers 99% of the languages I’ve seen in the Cm to date. I did solve one CON in Interlingua, but I don’t know where that list went to.

In the unsimplest sense, word lists can be generated to serve particular purposes that you may have in mind. Two examples are: by letter count, and by pattern.

A few people I’ve corresponded with will put all of the 3-letter words into one file, all the 4-letter words in another, and so on. I don’t consider separate files for letter counts to be all that useful, and it’s just as easy to read the entire list all at once, and then sort out the words to a list of lists, where each sublist has the words by letter count. Such as:

wordlist = [] * lengthoflongestwordinmasterlist

for word in masterlist:
... wordlist[len(word) - 1].append(word)

This snippet puts all of the one-letter words into wordlist[0], two-letter words into wordlist[1], etc. So, if I want to step through the five-letter words for some reason, I could use:

for word in wordlist[4]:
... do work on word

When would I use a word list?
Mainly, word lists are great for bruteforce attacks against keywords, and when trying to find matching words in substitution-type ciphers, such as Aristos, keyphrase, Ragbaby, and Checkerboard.

Ok, what’s a pattern word list?
Pattern words are those that have repeated letters, such as “repeated,” “letters,” “pattern,” and “people.” A pattern word list would then collect all of the words together by their pattern, making it easier for the user to walk through a collection of suspects to pick out specific words that might match the subject of the message.

Making patterns
A lot of people seem to use an “abc” pattern system, but I prefer a numeric list because it gives me the option to have really long patterns (over 26 characters long), and because I think comparing numeric lists is faster than string operations. But, I’ll demonstrate with alphabetics. The process is to start at the left of the word, and assign the first letter to “a”. Go to the next letter to the right; if that letter is the same as the first, assign “a” to it also, otherwise assign “b”. Repeat to the end of the word.

Examples:

letter --- abccbd
people --- abcadb
repeated - abcbdebf

In this way, if we looked in the list file, we might have something like:

abccbd:
BALLAD, BEDDER, BEGGED, BELLED,
BELLES, BETTER, BOFFOS, BORROW,
BOTTOM, CABBAL, CALLAS, CELLED,
COMMON, COTTON, DOLLOP, FELLED,
LEGGED, LESSEN, LESSER, LETTER

abccdef:
BADDEST, BAFFLED, BAFFLER, BAFFLES,
BAGGIER, BAGGIES, BAGGILY, BALLETS,
BOPPING, BOSSIER, BOSSILY, BOSSING,
BOTTLED, BOTTLER, BOTTLES, PATTERN

This is fine if you’re only looking for individual words at a time. But, if you have a CON such as follows:

WJI WTAJW TK GQZJ MNQQX HTRRZSNYD MFI TKYJS HTRRZSJI HQTXJQD BNYM YMJ FSNRFQX.

you’re going to have a lot of work ahead of you. From my list, I get over 100 hits on HTRRZSJI, another 100 on HTRRZSNYD, and MNQQX is no better. However, most CONs, even the harder contrived ones, will have multiple words that share letters. This is where I think my approach shines, although granted, making the pattern groups first into a dict object might speed things up a little bit.

Search on: HTRRZSNYD HTRRZSJI HQTXJQD

HTRRZSNYD - abccdefgh
HTRRZSJI -- abccdefg
HQTXJQD --- abcdebf

HTRRZSNYDHTRRZSJI - abccdefghabccdeij
HTRRZSNYDHTRRZSJIHQTXJQD - abccdefghabccdeijakblimh

1) Look for words that match the pattern “abccdefgh“.
2) Of the words that match the pattern “abccdefg“, check if both words together match the larger pattern “abccdefghabccdeij“.
3) Of the words that match the pattern “abcdebf“, check if all three words combined match the master pattern “abccdefghabccdeijakblimh“. Print all three words out if there’s a full match. Otherwise, loop through steps 3, then 2, then 1.

With my English word list, the only match is on: “COMMUNITY COMMUNED CLOSELY“.

Future versions of this kind of word finder function will be added to the GUI-enabled projects. And yes, my lists include non-patterns words, like “like,” “word,” and “both.”

How to get word lists?
The best way is to find someone that has a complete list of the words in your desired language, and ask them for a copy.

The next best alternative is to search the net for anything that has the words collected in one place, such as a dictionary site, wikipedia, or some group that teaches your target language. In my case, I started with Learn Esperanto. That only gave me about 700 words, and a limited number of verb endings and plural noun forms. I’ll build up the list as I solve more of the CONs from the ACA.

The drawback to using websites like this is that often a copy-paste approach also requires a huge amount of deleting unwanted explanatory text, or may include lots of repetitions. Which is why I have a clean-up file.

#####################
# clean_wordlist.py #
#####################

import os, sys

def main():
... dictionary_name = 'path to raw text list.txt'
... cleaned_name = 'file to save to.txt'
... read_and_process_rawfile = True
... min_ngram_cnt = 200

""
If read_and_process_rawfile is True then just read the file in dictionary_name. Follow this by sorting the words alphabetically and removing any duplicated words.
"""

... if(read_and_process_rawfile):
....... word_list = get_word_list(dictionary_name)
....... read_and_sort(word_list, cleaned_name)

"""
Otherwise, go through the cleaned wordlist and generate the ngrams lists.
"""

... else:
....... word_list = get_word_list(cleaned_name)
....... dict_big = {}
....... for word in word_list:
........... for n in range(2, 5):
............... if len(word) > n:
................... for i in range(0, len(word) - n):
....................... seg = word[i: i+n]
....................... if seg in dict_big: dict_big[seg] += 1
....................... else: dict_big[seg] = 1
....... print(len(dict_big))

"""
It's really easy for there to be 1000 or 2000 of just the 5-gram letter groups, and that's way too many, and will slow the ngram count process way down. So, check the number of times each ngram appeared in the word list, and only accept those above a particular threshold.
"""

....... for n in range(2, 5):
........... grams = []
........... for seg in dict_big:
............... if len(seg) == n and dict_big[seg] > min_ngram_cnt:
................... grams.append(seg)
........... grams.sort()
........... print(len(grams))
........... print(grams)

"""
Read the specified textfile and return the associated list object.
"""

def get_word_list(filename):
... fo = open(filename, 'r', encoding='utf-8')
... word_list = fo.read().upper().split('\n')
... fo.close()
... return word_list

"""
Read the raw list from the textfile and add each word to cleaned_list, avoiding repetitions.
Finally, sort the list and save it to the new file.
"""

def read_and_sort(listname, outname):
... cleaned_list = []
... for word in listname:
....... if word not in cleaned_list: cleaned_list.append(word)

... cleaned_list.sort()

... fo = open(outname, 'w', encoding='utf-8')
... for word in cleaned_list:
....... fo.write(word + '\n')
... fo.close()

How big is too big?
Ok, this question is kind of a matter of personal taste. I’ve seen a few people bragging about having three million or more words in their lists, but in some cases their files haven’t been filtered at all, and may contain nonsense strings like KKL and ZZZZZ. In my opinion, the bigger the file, the more time will be wasted on checking words that aren’t in the specific message you’re trying to solve. There was one file I received that was 12 meg, and took notepad several minutes to open. It was built on patterns (i.e. – aa, ab, aaa, aab, aba, abb, abc, etc.) and that would have to be parsed in Python to create the dict object first. Very slow.

I don’t need a “comprehensive” list. I just need the more common words and the ability to update the list easily when a solution turns out to have a word not already in the list. I’m currently debating whether to add that functionality to Cons Parser next, or to go straight back to Transpo Solver.

How to steal words from Wikipedia or Gutenberg.org?
Several people on the net have discussed the desirability of randomly sampling large numbers of online public domain resources, and building up wordlists that way. A few months ago, I experimented with writing a program for fetching a file from Gutenberg just to see what the process is.

There are two main issues to this approach. First, if you plan on randomly grabbing a bunch of e-books, you need to know how to get at them. This implies reading a table of contents file and sniping the URLs of all the books from that. It’s not an insurmountable task, just kind of fiddly if you have to figure out the html formatting you have to wade through.

The second issue is that there may be a lot of “chaff” in your target file that needs to be removed, such as html code, javascript, and other stuff you may not want. If you think about Wikipedia, you may need to weed out links to various article sections, and anything in the ToC links, footnotes and outside references sections. Additionally, if you’re working on other languages, you may find a lot of English slipping in as well. That’s a bit easier to deal by just comparing your xeno list to your English list, and automatically removing anything obviously English.

The following is just a rough attempt to play with getting files from the net in Python. And yes, I know this is not compliant with the Python guidelines. As I say, this was just a rough attempt. I’ll come back to this later when I have more time.

import urllib.request

dataLink = 'https://en.wikipedia.org/wiki/Scotch_whisky'
data = urllib.request.urlopen(dataLink)
x = parseText(data.read().decode('utf-8'))

print(x)

def parseText(t):
... temp = ''
... ptr = 0
... skip = False
... while ptr < len(t):

....... ch = t[ptr]
....... if ch == '<' or ch == '>':
........... skip = not skip
....... elif not skip:

........... temp += ch
....... ptr += 1

... temp = removeChars(temp)

... tempAry = temp.split(' ')
... ret = []
... for t in tempAry:
....... t = t.upper()
....... if hasValidChar(t):
........... if t not in ret and len(t) > 0: ret.append(t)

... return ret

def removeChars(t):
... temp = t.replace("""'""", '')

... for ch in '\n\t[]()=".!?':
....... temp = temp.replace(ch, ' ')

# Note, wordpress stupidly removes multiple spaces from the text.
# To compensate for this here, I'm creating sp to hold the two spaces
# in a var in order to remove them uniformly prior to using
# .split(' ') to create the list later.
... sp = ' ' * 2
... while sp in temp:

....... temp = temp.replace(sp, ' ')
... return temp

def hasValidChar(t):
... LETTERS = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
... temp = t.strip()

... for ch in t:
....... if ch not in LETTERS: return False

... return True

Next up: I don’t know.

Published by The Chief

Who wants to know?

One thought on “Python GUI 021 – Digression 2 – Word Lists

Leave a comment

Design a site like this with WordPress.com
Get started