SQL Server Data Engineering: Python

Sunday, 27 September 2020

Getting started with Python Web Scraping

Getting started with Python on a Mac was fairly straightforward, but I had a few stumbling blocks on Windows. The easiest way to get started with Python, a decent IDE & terminal, and additional libraries, was to install Anaconda on Windows, and use the Spyder IDE.

Using Beautiful Soup for web scraping, the following is a script I wrote to get the current top non fiction audiobooks from Audible:

# Script to get top Audible personal development books

import requests
from bs4 import BeautifulSoup

URL = 'https://www.amazon.co.uk/Best-Sellers-Books-Self-Help-How/zgbs/books/2996349031/ref=zg_bs_nav_b_3_2996114031'
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
results = soup.find('ol', class_ = 'a-ordered-list a-vertical')

list_elems = results.find_all('li', class_ = 'zg-item-immersion')
for list_elem in list_elems[:50]:
rank_elem = list_elem.find('span', class_ = 'zg-badge-text')
title_elem = list_elem.find('div', class_ = 'p13n-sc-truncate p13n-sc-line-clamp-1')
author_elem = list_elem.find('span', class_ = 'a-size-small a-color-base')

title_elem = title_elem.text.replace(' ', '')
title_elem = title_elem.replace('\n', '')

print(rank_elem.text.replace('#', '') + ' - ' + title_elem + ' - ' + author_elem.text.replace('\t',''))

Returns the following:

Sunday, 27 September 2020

Getting started with Python Web Scraping

Updating massive amount of rows whilst avoiding blocking