top of page

Using Python and Scrapy to Scrape Beer Data

  • Foto do escritor: Marcos Jonatan Suriani
    Marcos Jonatan Suriani
  • 30 de dez. de 2022
  • 2 min de leitura

I'm on a new project creating a (new) database with beer data, so it can be used later to enrich data from ratings, also to create specific user experience. There are a few places that have this public information, and using Scrapy is one of the easiest ways of creating a webcrawler/webscraper to gather this information to be processed in order to create this new database.


Scrapy is a fast high-level web crawling and web scraping framework, used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing.

The source-code for this project can be found at github.


The first scraper I've created focus on getting data from BeerAdvocate.com, and the strategy consists on first getting data from Styles, then for each style I get Beers details, and for each beer, I get brewery details.


The spider created is called BeerAdvocateSpider, and use the style listing page as starting url:


The strategy is to get data from each style listed, as follows:

def parse(self, response):
	style_pages = '#ba-content li a'
	yield from response.follow_all(css=style_pages, callback=self.parse_style)

The response.follow_all get details of each style. For instance, this is the Bock style page:


And this is how it's data is read:


def parse_style(self, response):
	def from_content(query):
		content_xpath = '//div[@id="ba-content"]/div[1]';
		return response.xpath(content_xpath).xpath(query).get(default='').strip()

	beers_page = response.xpath("//tr//td//a[contains(@href, '/beer/profile/')][1]")
	yield from response.follow_all(beers_page, callback=self.parse_beer)

	yield {
		'type' : 'style',
		'original_url': response.url,
		'doc': {
			'name': response.css('h1::text').get(),
			'description': from_content('text()'),
			'abv': from_content('span[contains(.,"ABV:")]/text()'),
			'ibu': from_content('span[contains(.,"IBU:")]/text()')
		}
	}

Then we get details for each beer listed:

def parse_beer(self, response):
	brewery_url = response.urljoin(response.xpath("//dt[contains(.,'From:')]/following-sibling::dd[1]/a/@href").get())

	yield response.follow(brewery_url, callback=self.parse_brewery)

	yield {
		'type' : 'beer',
		'original_url': response.url,
		'doc':{
			'name': response.css('h1::text').get(),
			'images': response.xpath('//div[@id="main_pic_norm"]/div/img').getall(),
			'brewery': {
				'original_url': brewery_url,
				'name': response.xpath("//dt[contains(.,'From:')]/following-sibling::dd[1]/a/b/text()").get()
			}
		}
	}

Then, for each Brewery,its details are scraped using:

def parse_brewery(self, response):

	yield {
		'type' : 'brewery',
		'original_url': response.url,
		'doc':{
			'name': response.css('h1::text').get(),
			'images': response.xpath('//div[@id="main_pic_norm"]/img/@src').getall(),
			'address':{
				'address':  response.xpath('//div[@id="info_box"]/text()').get()[2:3],
				'zipcode':  response.xpath('//div[@id="info_box"]/text()').get()[4:5],
				'city': response.xpath('//div[@id="info_box"]/a/text()').getall()[0],
				'state': response.xpath('//div[@id="info_box"]/a/text()').getall()[1],
				'country': response.xpath('//div[@id="info_box"]/a/text()').getall()[2],
				'map': response.xpath('//div[@id="info_box"]/a/@href').getall()[3],
				'website': response.xpath('//div[@id="info_box"]/a/@href').getall()[4]
			}
		}
	}

To run the project, follow setup instructions on github, then to run it locally specify the beeradvocate crawler and the choose the feed export. To export as json file run:

scrapy crawl beeradvocate -O data.json

This is a sample of the data generated: sample-data.json


Next step for this project consists in uploading the data to S3 using S3 Feed Storage, then data can be read and sent to a Kafka topic for later processing.

 
 
 

Comments


Conheça mais sobre essa fantástica Jornada!

Obrigado por se registrar!

bottom of page