top of page

Using Python and Scrapy to Scrape Beer Data

  • Writer: Marcos Jonatan Suriani
    Marcos Jonatan Suriani
  • Dec 3, 2022
  • 2 min read

ree

I'm on a new project creating a (new) database with beer data, so it can be used later to enrich data from ratings, also to create specific user experience. There are a few places that have this public information, and using Scrapy is one of the easiest ways of creating a webcrawler/webscraper to gather this information to be processed in order to create this new database.


Scrapy is a fast high-level web crawling and web scraping framework, used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing.

The source-code for this project can be found at github: jonatansuriani/beer-crawlers.


The first scraper I've created focus on getting data from BeerAdvocate.com, and the strategy consists on first getting data from Styles, then for each style I get Beers details, and for each beer, I get brewery details.


The spider created is called BeerAdvocateSpider, and use the style listing page as starting url:

ree

The strategy is to get data from each style listed, as follows:

def parse(self, response):
	style_pages = '#ba-content li a'
	yield from response.follow_all(css=style_pages, callback=self.parse_style)

The response.follow_all get details of each style. For instance, this is the Bock style page:

ree

And this is how it's data is read:


def parse_style(self, response):
	def from_content(query):
		content_xpath = '//div[@id="ba-content"]/div[1]';
		return response.xpath(content_xpath).xpath(query).get(default='').strip()

	beers_page = response.xpath("//tr//td//a[contains(@href, '/beer/profile/')][1]")
	yield from response.follow_all(beers_page, callback=self.parse_beer)

	yield {
		'type' : 'style',
		'original_url': response.url,
		'doc': {
			'name': response.css('h1::text').get(),
			'description': from_content('text()'),
			'abv': from_content('span[contains(.,"ABV:")]/text()'),
			'ibu': from_content('span[contains(.,"IBU:")]/text()')
		}
	}

Then we get details for each beer listed:

def parse_beer(self, response):
	brewery_url = response.urljoin(response.xpath("//dt[contains(.,'From:')]/following-sibling::dd[1]/a/@href").get())

	yield response.follow(brewery_url, callback=self.parse_brewery)

	yield {
		'type' : 'beer',
		'original_url': response.url,
		'doc':{
			'name': response.css('h1::text').get(),
			'images': response.xpath('//div[@id="main_pic_norm"]/div/img').getall(),
			'brewery': {
				'original_url': brewery_url,
				'name': response.xpath("//dt[contains(.,'From:')]/following-sibling::dd[1]/a/b/text()").get()
			}
		}
	}

Then, for each Brewery,its details are scraped using:

def parse_brewery(self, response):

	yield {
		'type' : 'brewery',
		'original_url': response.url,
		'doc':{
			'name': response.css('h1::text').get(),
			'images': response.xpath('//div[@id="main_pic_norm"]/img/@src').getall(),
			'address':{
				'address':  response.xpath('//div[@id="info_box"]/text()').get()[2:3],
				'zipcode':  response.xpath('//div[@id="info_box"]/text()').get()[4:5],
				'city': response.xpath('//div[@id="info_box"]/a/text()').getall()[0],
				'state': response.xpath('//div[@id="info_box"]/a/text()').getall()[1],
				'country': response.xpath('//div[@id="info_box"]/a/text()').getall()[2],
				'map': response.xpath('//div[@id="info_box"]/a/@href').getall()[3],
				'website': response.xpath('//div[@id="info_box"]/a/@href').getall()[4]
			}
		}
	}

To run the project, follow setup instructions on github, then to run it locally specify the beeradvocate crawler and the choose the feed export. To export as json file run:

scrapy crawl beeradvocate -O data.json

This is a sample of the data generated: sample-data.json


Next step for this project consists in uploading the data to S3 using S3 Feed Storage, then data can be read and sent to a Kafka topic for later processing.

Comments


Conheça mais sobre essa fantástica Jornada!

Obrigado por se registrar!

bottom of page