Using Python and Scrapy to Scrape Beer Data
- Marcos Jonatan Suriani
- Dec 3, 2022
- 2 min read

I'm on a new project creating a (new) database with beer data, so it can be used later to enrich data from ratings, also to create specific user experience. There are a few places that have this public information, and using Scrapy is one of the easiest ways of creating a webcrawler/webscraper to gather this information to be processed in order to create this new database.
Scrapy is a fast high-level web crawling and web scraping framework, used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing.
The source-code for this project can be found at github: jonatansuriani/beer-crawlers.
The first scraper I've created focus on getting data from BeerAdvocate.com, and the strategy consists on first getting data from Styles, then for each style I get Beers details, and for each beer, I get brewery details.
The spider created is called BeerAdvocateSpider, and use the style listing page as starting url:

The strategy is to get data from each style listed, as follows:
def parse(self, response):
style_pages = '#ba-content li a'
yield from response.follow_all(css=style_pages, callback=self.parse_style)The response.follow_all get details of each style. For instance, this is the Bock style page:

And this is how it's data is read:
def parse_style(self, response):
def from_content(query):
content_xpath = '//div[@id="ba-content"]/div[1]';
return response.xpath(content_xpath).xpath(query).get(default='').strip()
beers_page = response.xpath("//tr//td//a[contains(@href, '/beer/profile/')][1]")
yield from response.follow_all(beers_page, callback=self.parse_beer)
yield {
'type' : 'style',
'original_url': response.url,
'doc': {
'name': response.css('h1::text').get(),
'description': from_content('text()'),
'abv': from_content('span[contains(.,"ABV:")]/text()'),
'ibu': from_content('span[contains(.,"IBU:")]/text()')
}
}
Then we get details for each beer listed:
def parse_beer(self, response):
brewery_url = response.urljoin(response.xpath("//dt[contains(.,'From:')]/following-sibling::dd[1]/a/@href").get())
yield response.follow(brewery_url, callback=self.parse_brewery)
yield {
'type' : 'beer',
'original_url': response.url,
'doc':{
'name': response.css('h1::text').get(),
'images': response.xpath('//div[@id="main_pic_norm"]/div/img').getall(),
'brewery': {
'original_url': brewery_url,
'name': response.xpath("//dt[contains(.,'From:')]/following-sibling::dd[1]/a/b/text()").get()
}
}
}Then, for each Brewery,its details are scraped using:
def parse_brewery(self, response):
yield {
'type' : 'brewery',
'original_url': response.url,
'doc':{
'name': response.css('h1::text').get(),
'images': response.xpath('//div[@id="main_pic_norm"]/img/@src').getall(),
'address':{
'address': response.xpath('//div[@id="info_box"]/text()').get()[2:3],
'zipcode': response.xpath('//div[@id="info_box"]/text()').get()[4:5],
'city': response.xpath('//div[@id="info_box"]/a/text()').getall()[0],
'state': response.xpath('//div[@id="info_box"]/a/text()').getall()[1],
'country': response.xpath('//div[@id="info_box"]/a/text()').getall()[2],
'map': response.xpath('//div[@id="info_box"]/a/@href').getall()[3],
'website': response.xpath('//div[@id="info_box"]/a/@href').getall()[4]
}
}
}
To run the project, follow setup instructions on github, then to run it locally specify the beeradvocate crawler and the choose the feed export. To export as json file run:
scrapy crawl beeradvocate -O data.jsonThis is a sample of the data generated: sample-data.json
Next step for this project consists in uploading the data to S3 using S3 Feed Storage, then data can be read and sent to a Kafka topic for later processing.





Comments