Using Python and Google Colab to Scrape Content

When I first tried to learn how to program I took a Python class and I have never returned to it till now. Python was something working on the web that I would never come across a lot. I would use apps and frameworks that would not work with Python at all.

But, as with a lot of projects sometimes you have to use the best tool for the job. I needed to scrape some content and do some analysis on it. I have been obsessed with Serverless functions lately and I started initially using some node package to get the job done. It felt like it was going to be a bit too heavy to get the job done.

Then, I came across a post on Google Colab. Colab is a Python development environment that runs in the browser using Google Cloud. For the developers out there it felt like a Codepen for Python and I was sold. I wanted to figure out a way to use this. Colab gives you access to a machine with Python on it and it allows you to run things on demand. You can upgrade to get more power but for what we are going to do today the basic setup should be fine. All you need to get started is a Google account.

To get started let's head over to and create an empty notebook. The wonderful thing about Colab as well is that the virtual machines have a lot of packages already installed so we can get away with just importing them. I want to scrap content so I am going to use these.

import requests
import pandas as pd
from bs4 import BeautifulSoup

Beautifulsoup is a great library to scrape content and pandas is a data analysis tool that might be something you want to play with later once we get data.

pageurl = '<>'
url = pageurl
reqs = requests.get(url)
soup = BeautifulSoup(reqs.text, 'lxml')

The code above tells beautiful soup to crawl that page URL and use one of their built in parses to get the text.
Now, we can ask Soup for things on the page. Let's get some page information

## Get Page Title
title = soup.find('title')

## Get Description
meta_tag = soup.find('meta', attrs={'name': 'description'})

We can also use this to get title tags on a page

for heading in soup.find_all(["h1"]):
   print( + ' ' + heading.text.strip())

This type of data might seem like it would be super simple to get with a bunch of other tools and that is true. I am sure 45 browser extensions will be able to get you similar data. But, look at this as a start of something and think about combinations that you can start to make and extracting the data that is important for you and your projects.

Google Colab is one of the lowest barriers of entry to coding and learning python that I have seen out there. It takes a lot of complexity out of your way and it is free to use.

John Siwicki

John Siwicki