Python GUI 029 – Crappy Data Scraper

There are a number of youtube videos describing how to do data scraping on Amazon and Google. I haven’t been able to make their automatic approach work, so I came up with a half-assed version. The normal approach is to use a module called requests.py in combination with something called BeautifulSoup.

When I tried using requests.py, and the Python built-in urllib.request module, I got errors that an email.py file or something can’t be found in my work directory. I’m not interested in doing a lot of detective work on this project, so I’ll just keep moving on. Next, I don’t know anything about BeautifulSoup at all, other than it’s a third-party package designed to parse Amazon.html code for you. I don’t want to clutter up my computer with lots of different Python packages, so I didn’t try downloading and installing the bs4.py module BeautifulSoup is found in.

In other words, there are good methods (I’m told) for getting information from website source files. I’m not using them.

(Scraper GUI)

My approach is to use tkinter to create a GUI with a message bar, a source code paste area, two buttons and a menu bar with just a File-Exit option. I open a browser tab, go to Amazon.com, and search on “escape room game.” When I get the first page of the results, I open View Page Source and wait a second for the full source page to load. Then I select all and copy, switch to my GUI, and paste the clipboard contents into the source code paste area and follow this by clicking the Process button. After about 1-2 seconds, the paste area will clear, and the message bar will display “Amazon Page 2” and the number of game links found. From this point I can go to the next page of search results and repeat the process, or change the search string to something like “escape game unlock!” Regardless, when I’m done, I click the Finish button to output my results to an .html file for display and checking.

Ok, so why am I doing this?

One of my other activities for the Black Chamber is checking table top escape room games (such as from Exit, Unlock! and Escape Room the Game) for ciphers. I buy the games, play them, then review them. After close to 2 years and over 150 games, I’m pretty much caught up with the existing games on the market (that I’m interested in), and I’m checking whether anything new is about to come out, or if I’ve missed something that just came out. I need a “This company has a new product release” alert, and that’s not something Amazon supports, from what I can tell.

So, my plan was to write a data scraper, and then check Amazon for new game releases about once a week. Manually looking at all of the search result pages myself takes at least 10-15 minutes, and I’m hoping to streamline things a bit this way. Additionally, if I have hot-linked images to the new games on a separate HTML page, I can click on the image of the game, go to the right place on Amazon and add the game to my wish list for later purchase if the release date is in the future.

To further streamline things, I’ve added two textfiles – ignore.txt and havegame.txt. The program compares all of the games to the strings in ignore.txt (usually names of companies that make games I don’t want, like Talking Tables, or have descriptions of things I don’t want like “3D wooden puzzle”) to blanket skip any game that has a match. Conversely, I also check the game description against the strings in havegame.txt to screen out games I do have from companies I want more games from. The ultimate goal is to just get reports of a smaller number of games that are new that I can quickly glance through.

The reality is that the Amazon search results are a mess, with “recommendations” that aren’t escape games, and escape games that don’t get included in the search results for “reasons.” One case in point is Unlock’s! new Extraordinary Adventures pack that was scheduled for a Jan. 12, 2024 release, then disappeared from the search results, and is now marked “We don’t know when or if this item will be back in stock.”

On top of this, there seems to be more than one form that result data can be formatted in HTML. I’m highly tempted to try using BeautifulSoup anyway, in the hopes that the developers know what they’re doing more than I do. My code is a hack, I’m not proud of it.

But, my code does mostly work, and I’m going to keep using it for now. Note that this code comes with no guarantees or warranties. Your mileage will vary.

################### # data_scraper.py # ###################

# Imports

import tkinter as tk from tkinter import Menu, ttk from tkinter.messagebox import showinfo from pathlib import Path

""" global setup

disp: Flag for printing HTML tags html_*: Constants for formatting the HTML results output html_tab: Junk used to indent HTML body for readability htmlfilename: Name of the results output file """

disp = False html_header = '<!DOCTYPE html>\n<html>\n<head>\n<title>New Escape Game Check Results</title>\n</head>\n<body>' html_footer = '</body>\n</html>' html_block = '\n<p>\nZZ<a href="https://www.AAA">\n<img src="BBB" width="200">\n</a>\n<br />ZZCCC<br />\nZZDDD\n</p>\n' html_tab = '…..' htmlfilename = 'new_games_check_list.html'

""" msg_prompt: Text to display on information line msg_number: Page number being processed """

msg_prompt = 'Amazon Page ' msg_number = 1

# Create the global data class

class MainData:

... def __init__(self):

""" Get the textfile that contains the names of the games I already have. The titles are in the format: company name~game title

Read the file, split on newline, ignore blank lines, then split the company and game title on "~" to append as a list. """

....... havetemp =
........... get_filetext('amazon_havegame.txt').upper().split('\n') ....... haveproc = [] ....... for game in havetemp: ........... if len(game.strip()) > 0: ............... game_name = game.split('~') ............... haveproc.append(game_name)

""" ignore: The list of game title items to delete from the results list. havegame: The processed list created above. gameslist: Holder for the games we scrape from the page source. """

....... self.ignore =
........... get_filetext('amazon_ignore.txt').upper().split('\n') ....... self.havegame = haveproc ....... self.gameslist = []

# The main program as a class

class DataScraper(tk.Tk):

# Initialize the pointers to the tkinter screen widgets

... entry_box_list = {}

# Method for making the menu bar # Only one item: File-Exit

... def make_menu(root): ....... menubar = Menu(root) ....... root.config(menu=menubar)

....... # Create the main menu bar

....... file_menu = Menu(menubar, tearoff=0) ....... file_menu.add_command(label='Exit', command = root.destroy) ....... menubar.add_cascade(label='File', ........... menu=file_menu, underline=0)

# Class initializations

... def __init__(self, *args, **kwargs): ....... global maindata

# Tkinder Window definitions

....... tk.Tk.__init__(self, *args, **kwargs) ....... tk.Tk.wm_title(self, "Data Scraper") ....... self.geometry('1400x700') ....... self.resizable(0,0)

# Set the frame column ratios

....... self.columnconfigure(0, weight=4) ....... self.columnconfigure(1, weight=1)

# Call function to create the message frame

....... msg_frame, self.entry_box_list = create_msg_frame(self) ....... msg_frame.grid(column=0, row=0, columnspan=2)

# Call function to create the page source data frame

....... entry_frame, self.entry_box_list = ........... create_entry_frame(self, self.entry_box_list) ....... entry_frame.grid(column=0, row=1)

# Call function to create the buttons frame

....... button_frame, self.entry_box_list = ........... create_button_frame(self, self.entry_box_list) ....... button_frame.grid(column=1, row=1)

# Call the function for making the menu bar

....... self.make_menu()

# Load the global data object

....... maindata = MainData()

""" create_msg_frame()

'msg_line' is the dict item for referencing the information display entry widget. entries{} is created here and used in all other functions for passing the widget pointers.

Return the pointer to the frame, and the dict object containing the widget pointers. """

def create_msg_frame(root):

... frame = ttk.Frame(root)

... entries = {}

... frame.columnconfigure(0, weight=1)

... msg_text = tk.StringVar() ... entry_text = ttk.Entry( frame, width=80, textvariable=msg_text) ... entry_text.grid(column=0, row=0, sticky=tk.W, padx=5, pady=5) ... msg_text.set('%s %s' % (msg_prompt, msg_number)) ... entries['msg_line'] = msg_text

... return frame, entries

""" create_entry_frame()

'entry_area' is the dict item for referencing the page source entry widget. We want to create a scrollable text widget, plus vertical scrollbar.

Return the pointer to the frame, and the dict object containing the widget pointers.

"""

def create_entry_frame(root, ent):

... frame = ttk.Frame(root)

... entries = ent

... frame.columnconfigure(0, weight=1)

... text_area = tk.Text( frame, width=120, height=40) ... text_area.grid(column=0, row=0, sticky=tk.EW, padx=5, pady=5) ... text_area.configure() ... entries['entry_area'] = text_area

... scrollbar = ttk.Scrollbar(frame, orient = tk.VERTICAL) ... scrollbar.grid(row=0, column=1, sticky=tk.NS)

... text_area['yscrollcommand'] = scrollbar.set ... scrollbar['command'] = text_area.yview

... return frame, entries

""" create_button_frame()

We don't really need to do anything for setting or disabling the Process or Finish buttons, but 'process' and 'finish' are added to the entries dict object just in case. button_dummy is just used as an invisible spacer between the other two buttons.

Return the pointer to the frame, and the dict object containing the widget pointers """

def create_button_frame(root, ent):

... frame = ttk.Frame(root)

... entries = ent

... frame.columnconfigure(0, weight=1)

... button_proc = ttk.Button(frame, text='Process', ....... command=lambda: process_page(entries)) ... button_dummy = tk.Button(frame, bg='gray94', ....... relief='flat') ... button_finish = ttk.Button(frame, text='Finish', ....... command=lambda: export_html())

... for widget in frame.winfo_children(): ....... widget.grid(padx=5, pady=5)

... entries['process'] = button_proc ... entries['finish'] = button_finish

... return frame, entries

""" process_page()

This is the function that executes when the user clicks "Process" on the main window. We're going to take the Amazon results page html source from the text widget on the main window, and first check whether we have the complete HTML source. We do this by looking for the line containing the string "(MEOW)". If we don't have that, display an error in the information line at the top of the window. Otherwise, increment the page counter, and process the page. """

def process_page(entry): ... global msg_number, maindata

# Copy the contents of the entry widget

... amazon_page = entry['entry_area'].get('1.0', 'end')

# Check for completeness

... if '(MEOW)' not in amazon_page: ....... entry['msg_line'].set( ........... 'ERROR -- Incomplete page -- Not processed') ....... return

# Increment page counter

... msg_number += 1

# Process the HTML source and add the new games # found to the gameslist member in the maindata global object

... maindata.gameslist, new_cnt = ....... get_game_list(amazon_page, maindata.gameslist)

# Display the new page number in the information line

... entry['msg_line'].set('%s %s [Added %s games]' ....... % (msg_prompt, msg_number, new_cnt))

# Clear the entry widget in prep for more html page source

... entry['entry_area'].delete(1.0, 'end')

""" export_html()

Save the desired game data as a crudely-formatted HTML page. """

def export_html(): ... global maindata

# Load the global object data to local variables for speed.

... ignore = maindata.ignore ... havegame = maindata.havegame ... game_list = maindata.gameslist

# Prep the html file with the header section. # total is used to track the number of games saved to the html file.

... html_build = html_header ... total = 0

# For each game_list item:

... for elem in game_list:

# State we want this game by default

....... showgame = True ....... for ignore_game in ignore:

# Go through the ignore list and check whether the # current game title contains something we want to # ignore.

........... if ignore_game in elem[0].upper(): ............... showgame = False

....... if showgame:

# Not going to ignore this game. Now, check # whether we already have this game

........... have = False ........... for game in havegame:

# If both the company name and the game title are in # the game description (elem[0]), we have this game. # Ignore it.

............... if game[0] in elem[0].upper() and ................... game[1] in elem[0].upper(): ................... have = True

# If we don't have this game, increment the total # counter, then insert the game text into the html template

........... if not have: ............... total += 1

# Preformat the html block with fake tabs

............... html_component = ................... html_block.replace('Z', html_tab)

# Break up the game description line into segments # of 80 characters or less. Also insert the fake # tab characters at the beginning of each line.

............... gametitle = change_length(elem[0]). ................... replace('Z', html_tab)

# Load the game URL, image, description and price # into the template

............... html_component = html_component. .................. replace('AAA', elem[1]) ............... html_component = html_component. .................. replace('BBB', elem[2]) ............... html_component = html_component. .................. replace('CCC', gametitle) ............... html_component = html_component. .................. replace('DDD', elem[3])

# Add our completed template to the html body block.

............... html_build += html_component

# We've added all the desired games. # Add the footer to complete the output html file.

... html_build += html_footer

# Write the finished html output file as given in html_build.

... write_html(htmlfilename, html_build)

# Display an info box to let the user know we actually did work.

... showinfo('New Game Updater', ....... 'Saved to %s.' % (htmlfilename))

""" change_length()

Break up the game description into lines no longer than 80 characters each. Start by setting ptr to 80, then decrement until we find a blank character. If we get to zero, assume that we have an unbreakable line and split it automatically at 80 characters. Add the fake tab template characters (ZZ) at the beginning of each line. """

def change_length(text): ... ret = ''

# If the line is already under 80 characters, return it

... if len(text) <= 80:
....... return text
... else:

# Repeat splitting (inserting newline) until done

....... while len(text) > 80: ........... ptr = 80 ........... while text[ptr] != ' ' and ptr > 0: ............... ptr -= 1 ........... if ptr == 0: ............... ptr = 80 ........... ret += text[:ptr] + '<br />ZZ' ........... text = text[ptr + 1:]

# If we have a dangling tail, add it to ret

....... if len(text) <= 80:
........... ret += text[:ptr]
........... text = ''

... return ret

"""
get_game_list()

Ok, here's where we do the heavy lifting. The page source html is broken up into <script> sections and <div > sections. We can simply cut out the <script></script> sections because they don't contain the game data. The <div > sections will contain <img >, <span>, <a href=> sections and so on. I wanted to analyze each section with the nesting intact, so I set up a nest counter (spacer). Spacer is incremented with each starting tag (i.e. - <div >) and decremented with each matching closer (i.e. - </div>).

The data block for each game (product description) starts with <div 'data-asin='>. Store the spacer count for this nesting level. When we decrement with </div> and get to the stored spacer count value again (that is, if "<div 'data-asin='>" is at spacer==6, then we want to identify the "</div>" tag that causes spacer to equal 6 again), take the game data we had and append it as a list to game_list[].

When the length of the variable "text" reaches 0, we've hit the end of the page source html. Return game_list[], and the number of new games added to it. Note that it's not really easy detecting tags that can be uniquely identified for specific desired values. In a few cases, all I can go by is a formatting string for the title, or the word "price" in a <span > tag, and it turns out that often those identifiers have more than one match. It's easier (and sloppier) to just make those variables lists and append the multiple values as they show up. In all cases so far, the first value in the list has been the right one, so I'll just use something like "cover_image = img_list[0]" during the assignments.

The exception is for price. If there are multiple values in the list, one will be the regular price, and the other will be the current discounted price. In this situation, I'll join them as "price = '/'.join(price_list)".
"""

def get_game_list(text, game_list):

# The section of the html page source we want starts at the below
# point. Find this point and remove the previous part before we begin
# processing.

... ptr1 = text.find('')
... text = text[ptr1+len(''):].strip()

# Initialize the function variables.

... spacer = 0
... done = False
... div_cntr = 0
... div_hit = -1
... tag_length = 90
... cover_image = []
... game_title = ''
... game_url = []
... game_price = []
... old_cnt = len(game_list)

... while not done:

"""
In the loop, we start by looking for the string segment that starts with '<' and ends with '>'. Then, return everything to the left of '<' as the lead(er), everything between '<' and '>' as the tag, and what's to the right of '>' as the remaining text.
"""

....... lead, tag, text = get_next_tag(text)

# Check if text is length 0. If so, return the game_list, and
# the number of games we added to it this time.

....... if len(text.strip()) == 0:
........... return game_list, len(game_list) - old_cnt

# Start processing each of the tags.
# In some cases, all we're doing is changing indentation.

....... if '<div ' in tag:

# Track indentation, and the number of open <div > tags.

........... div_cntr += 1
........... if disp: print('%s' % ((spacer * ' ') + tag[:tag_length])) ........... spacer += 1

........... if 'data-asin=' in tag:

# We've hit the beginning of a new product description block.
# Store the div counter, and initialize the game vars.

............... div_hit = div_cntr
............... cover_image = []
............... game_title = ''
............... game_url = []
............... game_price = []
....... elif '</div>' in tag:

# Decrement the div counter and check whether
# we've reached the end of the game data.

........... spacer -= 1
........... div_cntr -= 1
........... if div_hit > -1 and div_cntr == div_hit
............... and len(game_title) > 0:

# We've hit the end of the game data. # Save the current game var data to game_list.

............... game_list.append( ...................[game_title, game_url[0], ................... cover_image[0], ................... '/'.join(game_price)])

............... div_hit = -1

........... if disp: print('%s' % ((spacer * ' ') + tag))

# With <script>, just find the matching </script> # tag and delete everything in between.

....... elif '<script' in tag: ........... while tag != '</script>': ............... lead, tag, text = get_next_tag(text)

# <span> can include price and game title information.

........ elif '<span' in tag: ........... if disp: print('%s' % ((spacer * ' ') + tag[:tag_length])) ........... spacer += 1

........... if 'class="a-offscreen"' in tag:

# What follows the tag is price data.

............... ptr1 = text.find('<') ............... game_price.append(text[:ptr1].strip())

........... if 'span class="a-size-base-plus a-color-base a-text-normal' in tag:

# We have game title data. Isolate it. # Replace code items with their ASCII values.

............... ptr1 = text.find('</span')

# Remove space in “& #x27” and “& amp;”

............... holder = text[:ptr1]. ................... replace('& #x27', "'"). ....................replace('& amp;', '&')

# Don't save the isolated title data if it says # "Click to see the price."

............... if 'CLICK TO SEE' not in holder.upper(): ................... game_title = holder

....... elif '</span>' in tag: ........... spacer -= 1 ........... if disp: print('%s' % ((spacer * ' ') + tag))

....... elif '<a ' in tag:

# We have an anchor tag. Do the indenting.

........... if disp: print('%s' % ((spacer * ' ') + tag[:tag_length + 100]))

........... if 'href=' in tag and len(game_url) == 0:

# There are two kinds of href="" strings. # The first one contains "/sspa/click?" and # requires additional processing.

............... if 'href="/sspa/click?' in tag: ................... ptr1 = tag.find('url=') + len('url=') ................... ptr2 = tag[ptr1:].find('%3F') ................... url = 'amazon.com' + tag[ptr1: ptr1 + ptr2] ................... url = url.replace('%2F', '/') ................... url = url.replace('%3D', '=') ................... game_url.append(url)

# The second kind just requires some trimming.

............... else: ................... ptr1 = tag.find('href="') + len('href="') ................... ptr2 = tag[ptr1:].find('?') ................... game_url.append('amazon.com' +
................................... tag[ptr1:ptr1 + ptr2])

........... spacer += 1

# Everything else just controls indenting

....... elif '</a>' in tag: ........... spacer -= 1 ........... if disp: print('%s' % ((spacer * ' ') + tag)) ....... elif '<i ' in tag: ........... if disp: print('%s' % ((spacer * ' ') + tag[:tag_length])) ........... spacer += 1 ....... elif '</i>' in tag: ........... spacer -= 1 ........... if disp: print('%s' % ((spacer * ' ') + tag)) ....... elif '<h1 ' in tag: ........... if disp: print('%s' % ((spacer * ' ') + tag[:tag_length])) ........... spacer += 1 ....... elif '</h1>' in tag: ........... spacer -= 1 ........... if disp: print('%s' % ((spacer * ' ') + tag)) ....... elif '<h2 ' in tag: ........... if disp: print('%s' % ((spacer * ' ') + tag[:tag_length])) ........... spacer += 1 ....... elif '</h2>' in tag: ........... spacer -= 1 ........... if disp: print('%s' % ((spacer * ' ') + tag))

....... elif 'img ' in tag:

""" We've hit the images section. If we have "src=", isolate the image URL and append it to our cover_image list. As mentioned above, there may be multiple images, but we'll only use the first one in the list. """

........... if disp: print('%s' % ((spacer * ' ') + tag[:tag_length])) ........... if ' src=' in tag: ............... ptr1 = tag.find(' src=') + len(' src=') ............... ptr2 = tag[ptr1 + 1:].find('"') ............... cover_image.append(tag[ptr1 + 1: ptr1 + ptr2 + 1])

....... else:

# We've found an unexpected tag. Print it out if desired.

........... if disp: print('|%s|' % (tag[0:95]))

""" get_filetext()

This is just a file reader, for obtaining the ignore.txt and havegame.txt contents. """

def get_filetext(fn): ... path = Path(fn) ... ret = ''

... if path.is_file(): ....... with open(fn, 'r', encoding='utf8') as file: ........... ret = file.read() ....... file.close()

... return ret.strip()

""" write_html()

Take the finished, processed html containing our new game results and save it with an .html extension. """

def write_html(fn, outtext): ... with open(fn, 'w', encoding='utf8') as file: ....... file.write(outtext) ... file.close()

""" get_next_tag()

Take our current page source text and find the positions of the first '<' and the first '>'. Return everything to the left of '<' as leader. Return everything to the right of '>' as text. Return everything else as our tag. """

def get_next_tag(text): ... ptr1 = text.find('<') ... ptr2 = text.find('>') + 1 ... leader = text[0:ptr1] ... tag = text[ptr1:ptr2] ... tail = text[ptr2:].strip()

... return leader.strip(), tag, tail

# End of program. Run app.

app = DataScraper() app.mainloop()

Next up: I have no idea.

Python GUI 029 – Crappy Data Scraper

Published by The Chief

Leave a comment Cancel reply

Share this:

Related

Published by The Chief

Leave a comment Cancel reply