The key concepts in web scraping

Understand these 5 key to scraping

Oct 16, 2024

I started FetchFox to make scraping accessible to everyone, and I’ve been surprised by how many people use scrapers. It’s not just coders: people in marketing, sales, investing, and project management know what scrapers are and what to use them.

FetchFox makes it really easy to scrape without any technical knowledge, but the tool works best if you know a couple key concepts about web scraping. I’ll try to explain those in this post. Don’t worry, there’s no coding required.

Concept #1: Starting at the start

Every scraping task has one or more starting URLs, also called a parent URL. These tell the scraper where to start looking for data. The scraper will start by loading these URLs and doing one of two things: pulling data directly from the starting URL, or finding more pages to scrape from the starting points.

Concept #2: Crawling for more

Lets say your starting URL is a jobs page like WeWorkRemotely, and you want to scrape all the jobs on there. Your starting URL would be “

https://weworkremotely.com/

”. If you visit this page, you’ll notice that most of the key information is not on the starting page. Its on the job listing pages.

For this kind of scrape, we’ll need to do something called “crawling.” Crawling takes your starting URL, and it finds the pages that actually have the data you’re looking for. In this example, it would be the job listing pages like “https://weworkremotely.com/remote-jobs/tiller-social-media-manager”. This page has the job description, location, salary range, company website, and more. This is the data we want to get out.

The job of the crawl step is to find all the URLs with the relevant data. In our example, that would be all the links to job listings. The crawl stage is also responsible for handling pagination: if there are multiple pages of results, we want to make sure we go through all of them.

Once the crawl stage is done, the URLs it finds are passed on to extraction.

Concept #3: Extracting our data

The extraction step is what takes a web page, and gives your structured data. For each URL we found in the crawl stage, it gives out one row in our results (eg. one CSV row, or one spreadsheet row).

To continue the example above, suppose we are looking for location, salary range, company name and website. For the URL “https://weworkremotely.com/remote-jobs/tiller-social-media-manager”, the extraction step would give this data:

Company name: Tiller
Company website:

https://www.tillerhq.com/

Location: Anywhere in the world
Salary range: $50,000 - $74,999 USD

Above is one result that the extraction step gives. The full extraction step does this for every URL we found in the crawl step.

Concept #4: Deep crawling

The example we described above is a “1 step crawl.” We started at the homepage, and found links from there. But what if we want to follow more links? For example, what if we want to find all the job listings, and then go to their company websites, and then find their blogs. This would be a 3 step crawl: homepage → job listing → company site → blog.

Generally, doing more than 1 step in your crawl stage is known as deep crawling. This extends the time it takes to do the scrape, but it can be a powerful way to get rich data.

Concept #5: Filtering your results

Filtering is a way to narrow down the results the scraper finds. Typically you would filter on information you got in the extraction step, and remove some items from your results set. This can be used to give clean data, or to aid in a deep crawl.

For example, in our deep crawl, we could add a filter step for the salary range. We can instruct the scraper to find all the job listings, and then filter down to just the ones that pay over $100,000. From the results of that filter, have it continue the deep crawl and get their blogs.

Scraping for all

I hope learning these concepts helps you write better, more effective scrapes. New AI tools like FetchFox make scraping accessible to anyone, not just coders. With the concepts you learned in this post, you’ll be able to scrape any data from any website.

marcell's substack

Discussion about this post