March 22, 2014

Scraper Service Whishlist

At Skimbl, we are monitoring restaurants’ guest reviews for a whole bunch of websites to provide Quality Control reports to restaurant groups.

Since most of the sites we monitor don’t have any API, we need to regularly parse their HTML. Whenever a site changes or something breaks, I need to be notified and react accordingly. While I am reasonably happy with what I came up with, I still think it could be done in a much better way, as a service.

There are already a bunch of services dealing with scrapping. None of them really matches our needs. Hopefully I am not the only folk in this situation.

What I need is a scrapping service for geeks with Quality Control and UI sweetness. The same way Sublime Text is an editor for geeks with an amazing UI.

This service would be used to: #

Create scrapers for different sources of data.
Run them on a regular basis.
Make sure they are running properly.
When they are not running properly:
- be notified;
- modify them easily without breaking compatibility with other scrapers.
Use the data in some application.

In order to do that, such a service would allow its users to: #

Define a common format for all scrapers.
Specify expectations about what the scraper should return. For example: “field X is a date between now and Jan. 1st, 2010”, “if there has been some items the previous time we scrapped, there should be some items this time (no items indicating a probable error)”. These expectations should be specified on the common interface and it should be possible to make them stricter on each scraper.
Implement new scrapers in a visual way. It should be possible to specify nested content, pagination and advanced things such as parsing scripts and fetching custom URLs (ie. http://example.com/ajax_url/<SOME_ID>)
Run each scrapper individually on demand.
Schedule scrapers to run “every hour”, “every day” etc.
Have versions for scraper implementation and scheduling.
Each time a scraper is run, check the fetched data against the expectations. Send daily notifications about the results.
When a scraper fails, retry immediately, then after 10mn, then after 1h, then 5h, 10h etc. Ignore the regular schedule in case of failure.
Keep a log of each scraper’s execution status.
Define templates and create new scrapers through an API. For example I want to create a template “Ebay Article” that should run everyday and then create an instance of it with a specific article URL.
Each time a scraper is run, the result should be sent via POST to a custom URL callback.

So far, none of GetData.io, Import.io, KimonoLabs or any other fits the bill. Did I miss anything?

Kudos

Scraper Service Whishlist

This service would be used to: #

In order to do that, such a service would allow its users to: #

Now read this

Le Cloud expliqué avec des machines à laver