Python GUI 028 – Wordlist Updater, cons_parser_update_wordlists.py

And, here’s the new file for updating the wordlists, treated as a utility module.

This file contains the following functions:

get_languages(records)
get_words_for_language(lang, records, current_list)
get_wordlist(fn)
save_wordlist(fn, wordlist)
get_ngrams(wl, ngram_cnt_max, ngram_no_max, language, substitute)
get_character_lists(language)
sort_first(val)
sort_second(val)
sort_third(val)

###################################
# cons_parser_update_wordlists.py #
###################################

"""
imports
Bring in the CON record structure and field enums.
Bring in the standard file handing from Path.
"""

from cons_gui_utils import ClassConRecords, ConTitleFields
from pathlib import Path

"""
get_languages()

Read the CON data from the Con records object, and collect a full list of all of the languages used.
"""

def get_languages(records):
... languages = []

... for record in records:
....... lang = record[1][ConTitleFields.language.value]

# Avoid entering a language more than once.

....... if len(lang) > 0 and lang not in languages:
........... languages.append(lang)

... return languages

"""
get_words_for_language()

Once the user has selected a language to update, go through each CON record again, looking only for CONs for that language that have been marked as "done."

As a philosophical decision, I'm also collecting all of the keywords for the solved CONs, and if they're not empty, adding them to the end of the solution text for processing. This is fine for English CONs, but is creating a distraction for me when xeno CONs have English keys.

Then, run through the solution text, removing punctuation and extra spaces.

For languages like German and French, run through the substitute dict object and replace each character special to that language with the English equivalent.

Split on space, and return the final processed list.
"""

def get_words_for_language(lang, records, current_list):
... sols_list = []

# For each CON, store the record data to variables for ease of use.

... for record in records:
....... l = record[1][ConTitleFields.language.value].lower()
....... d = record[1][ConTitleFields.done.value]
....... k = record[1][ConTitleFields.key.value].strip().upper()
....... klen = len(k)

# Process if we have the right language and Done == True

....... if lang == l and d:

# Append the solution text for this CON to sols_list

........... sols_list.append(record[3].upper().strip())

# If the key field is not empty, append the key word(s) as well.

........... if klen > 0:
.............. sols_list.append(k)

# Turn sol_mass into one long, space-separated string.

... sol_mass = ' '.join(sols_list)

# For this language, get the clean-up strings.
# nosp - Characters that are simply removed.
# tosp - Characters that are replaced by spaces.
# legal - Characters allowed to remain in the string.
# substitute - xeno characters plus their replacements.

... nosp, tosp, legal, substitute = get_character_lists(lang)

# Just remove contractions

... for punct in nosp:
....... sol_mass = sol_mass.replace(punct, '')

# Other characters, replace with a space.

... for punct in tosp:
....... sol_mass = sol_mass.replace(punct, ' ')

# Print out any remaining characters not in the legal list.
# In the future, I may turn this into a showinfo() box.

... for ch in sol_mass:
....... if ch not in legal:
........... print('%s: Character not recognized: |%s|' % (lang, ch))

# Remove all double spaces.

... while ' '*2 in sol_mass:
....... sol_mass = sol_mass.replace(' '*2, ' ')

"""
Note that substitute doesn't get used here. I'm returning it to the calling function. If the user decides to update the n-grams lists, we'll use substitute then.

Turn the string into a list, split on space.
Then, only keep the new words not already in the existing wordlist, and sort them alphabetically.
"""

... short_list = []
... words_list = sol_mass.split()

... for word in words_list:
....... if word not in current_list and word not in short_list:
........... short_list.append(word)
... short_list.sort()

# Return the finished new words list, and the substitute dict object.

... return short_list, substitute

"""
get_wordlist()

Read the specified file and return everything in uppercase, and split on newline.

Note that we're actually using get_wordlist() to read the n-grams file, as well as the wordlist file.

I may consider moving get_wordlist() and save_wordlist() to a util file, like crypt_utils.py, at a later date.
"""

def get_wordlist(fn):
... path = Path(fn)
... ret = []

... if path.is_file():

"""
File exists. Read it and return the contents.
We need to use utf8 for character support for non-English languages.
"""

....... with open(fn, 'r', encoding='utf8') as file:
........... ret = file.read().upper().split('\n')
....... file.close()

... return ret

"""
save_wordlist()

Quick little default function for saving data to the specified file in utf8 format.

Take the list object we receive and join it with newline characters. Again, save_wordlist() is also used to save the n-gram data.
"""

def save_wordlist(fn, wordlist):
... file = open(fn, 'w', encoding='utf8')
... file.write('\n'.join(wordlist))
... file.close()

"""
get_ngrams()

Given the wordlist for the current language, the language itself, the desired n-gram count and total settings, and the substitute dict for this language, determine the current n-grams and return the top N to the caller.

substitute was provided by get_words_for_language().

Return the individual 2-, 3- and 4-gram lists as space-delimited strings that are newline delimited.

i.e.
'er re ed' + '\n' + 'ist ion age' + '\n' + 'tion tute'
"""

def get_ngrams(wl, ngram_cnt_max, ngram_no_max, language, substitute):
... ret = []
... wordlist = wl

"""
Replace the special xeno letters, if they exist, with their English equivalents.
For French, remove the apostrophe (we don't want it to appear in the n-grams.
Do all of this as a string, which is easier than trying to do it with a list object.
"""

... if len(substitute) > 0 or language.upper() == 'FRENCH':
....... wordlist = ' '.join(wordlist)
....... if language.upper() == 'FRENCH':
........... wordlist = wordlist.replace("'", "")

....... for sub in substitute:
........... if sub in wordlist:
............... wordlist = wordlist.replace(sub, substitute[sub])

# Go back to working with a list object.

....... wordlist = wordlist.split(' ')

# For each n-gram width (0 to width-2, then add 2)...

... for i in range(ngram_cnt_max - 1):
....... gram_len = i + 2
....... d = {}

# Take each word in the list

....... for word in wordlist:

# If the word is longer than the gram width...

........... if len(word) >= gram_len:

# Run through the word in gram-width segments.

............... for ptr in range(len(word) - gram_len):

# Pull out the segment.

................... gram = word[ptr:ptr + gram_len]

# If the segment is not in the dict, add with cnt 1.
# Otherwise, increment the cnt for that segment.

................... if gram in d: d[gram] += 1
................... else: d[gram] = 1

# We're done collecting the segments and their counts.

....... temp_list = []
....... short_list = []

# Copy the data from the dict object to a list for sorting.

....... for gram in d:
........... temp_list.append([gram, d[gram]])

"""
I'm using sort_second() (below) to set up the conditions for sorting on the ngram counts. Sort on reverse count (largest to smallest).
"""

....... temp_list.sort(key = sort_second, reverse = True)

"""
This next part is a bit tricky. While English will have LOTS of ngrams for each width, many of the other languages may only have 60 or 90. So, if the user says "get the top 200, and we only have 90, we're going to get an index out of bounds error. Set smallest_max to the smaller of "user preference" and "total ngram count".
"""

....... smallest_max = ngram_no_max
....... if len(temp_list) < smallest_max:

........... smallest_max = len(temp_list) # Could use min() I guess.
....... for j in range(smallest_max):
........... short_list.append(temp_list[j][0])

# Append the list to ret as a space-separated string.

....... ret.append(' '.join(short_list))

# Return the finished list.

... return ret

"""
get_character_lists()

For the given language, return the "no space," "to space," and "legal" strings, and the substitute" dict object.
"""

def get_character_lists(language):
... lang = language.lower()

# Defaults

... nosp = "'’"
... legal = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ '
... sub = {}

... if lang == 'english':
....... tosp = '\n#,.!?¿*+"();:--—“”/1234567890$‘=…_'

... elif lang == 'french':
....... nosp = "’"
....... tosp = '\n#,.!?¿*+"();:--—“”/1234567890$‘=…_'

... elif lang == 'german':
....... tosp = '\n#,.!?¿*+"();:--—“”/1234567890$‘=…_'
....... sub = {'Ä':'A', 'Ö':'O', 'Ü':'U' }

... elif lang == 'spanish':
....... tosp = '\n#,.!?¿*+"();:--—“”/1234567890$‘=…_'
....... legal = 'ABCDEFGHIJKLMNOPQRSTUVWXYZÍ '
....... sub = {'Í':'I', 'Ñ':'N'}

... elif lang == 'italian':
....... tosp = '\n#,.!?¿*+"();:--—“”/1234567890$‘=…_'

... elif lang == 'latin':
....... tosp = '\n#,.!?¿*+"();:--—“”/1234567890$‘=…_'

... elif lang == 'esperanto':
....... tosp = '\n#,.!?¿*+"();:--—“”/1234567890$‘=…_'

... elif lang == 'afrikaans':
....... tosp = '\n#,.!?¿*+"();:--—“”/1234567890$‘=…_'

... elif lang == 'dutch':
....... tosp = '\n#,.!?¿*+"();:--—“”/1234567890$‘=…_'

... elif lang == 'norwegian': # *wxz
....... tosp = '\n#,.!?¿*+"();:--—“”/1234567890$‘=…_'
....... legal = 'ABCDEFGHIJKLMNOPQRSTUVWXYZÅØ '

... elif lang == 'swedish': # *qwz
....... tosp = '\n#,.!?¿*+"();:--—“”/1234567890$‘=…_'
....... legal = 'ABCDEFGHIJKLMNOPQRSTUVWXYZÅÄÖ '

... elif lang == 'portuguese':
....... tosp = '\n#,.!?¿*+"();:--—“”/1234567890$‘=…_'

... else:
....... tosp ='\n#,.!?¿*+"();:--—“”/1234567890$‘=…_'

# May want to turn this print statement into a
# showinfo() box.

....... print('%s not recognized in cons_parser_update_wordslist.py' % (lang))

... return nosp, tosp, legal, sub

"""
Python has a very rudimentary sort method that only works on single-element lists. However, it is possible to have more complex sort functions specified in the .sort() statement.

var.sort([key=method], [reverse=True])

I'm using this format in get_ngrams() above.

temp_list is in the form temp_list[ngram_segment, count], so to sort on count in reverse order, I'm using:

temp_list.sort(key = sort_second, reverse = True)

Originally, I had all three of the below key methods in my hillclimber_utils.py module, but I think I'm going to move them into crypt_utils.py in the foreseeable future.
"""

def sort_first(val):
... return val[0]

def sort_second(val):
... return val[1]

def sort_third(val):
... return val[2]

Next up: I have no idea.

Published by The Chief

Who wants to know?

Leave a comment

Design a site like this with WordPress.com
Get started