"Using most libraries to write web scrapers involves writing code to request, parse, select, (perhaps) modify, and serialise HTML content. You might request a single URL; or fetch a collection of URLs populated from either the results of another task, or known patterns in addresses. You're likely selecting content with XPath expressions, scoped to single item-representing DOM elements. You're probably serialising to a list of objects with fixed structure, reflecting how you expect the data was originally represented.

That's a lot of boilerplate! Most of what varies from project to project is the structure of target pages, and the data underlying them. Some friends and I have been working on a library to abstract that away, so you don't need to describe anything but the features unique to your project. We're motivated by the ideal of a function from the minimal representation of site structure in data, to the data it indended to specify.

The talk will touch on: patterns in web scraping, code as data, ad-hoc polymorphism constrained to objects sharing algebraic structure, &c."

Vi helps Singapore based web-data startup Krake. He likes math, functional programming, and security.

vi.jpg

JSConf.Asia is the JavaScript, web and mobile developer conference for Asia. Hotel H2O and Manila Ocean Park, Philippines - 28 + 29 November 2013.

See all JSConf.Asia 2013 Talks

Comment