User Tools

Site Tools


websites_scraping

If you want to send us your comments, please do so. Thanks
More on comments


Scraping

HTML based websits

Use curl or wget to download a page and scrape it

Frontend frameworks

Some websites are build with modern frontend frameworks such as AngularJS, React or Vue.js. Their contents can not be downloaded with curl or wget. So how to get the contents of those website so it can be automatically processed?
Our goal is a lightweight solution

Conclusion

Scraping a frontend framework based webpage is a lot of work, compared to scraping a HTML based webpage, in learning and building code
Our solution:

  • Avoid those websites if you think you need to scrape data from them at any time
  • Do not buy products from those websites if you need information from it later like when a new firmware update is available
  • Find an other source for the information which has an easy scrapeable website
  • We will never use those frameworks if we can avoid them since we do not want to put up other people who want some information from our website with all the extra work and hassle

Results of our investigations

  • casperjs Navigation scripting & testing for PhantomJS and SlimerJS
  • Cheerio is a “Fast, flexible & lean implementation of core jQuery. It needs Node.js and npm if you need to execute JavaScript. An implementation example can be found in this Cheerio tutorial and here and here
  • CommonJS Seems deprecated since november 2014
  • JSdom needs node.js. Made for scraping: “the goal of the project is to emulate enough of a subset of a web browser to be useful for testing and scraping real-world web applications.”
  • Node.js® is a JavaScript runtime built on Chrome's V8 JavaScript engine. Node.js is an event-based server-side JavaScript engine
  • PhantomJS development is suspended until further notice. PhantomJS version 2.1.1 is the last known stable release. PhantomJS runs on top of Webkit
  • ReactJS is a JavaScript library for building user interfaces
  • Request is a “Simplified HTTP request client.”
  • RequireJS is a module loader
  • SlimerJS runs on top of Gecko

Not light weight

Java based

HtmlUnit is a GUI-Less browser for Java programs

Google Chrome based

Headless Chrome Crawler installes Chromium and Puppeteer
HtmlUnit Driver This is currently the fastest and most lightweight implementation of WebDriver. As the name suggests, this is based on HtmlUnit
Selenium is an umbrella project encapsulating a variety of tools and libraries enabling web browser automation. It needs Google Chrome (privacy issues ahead) and the Chromedriver

Mozilla Firefox based

Using Firefox in Headless Mode


Main subjects on this wiki: Linux, Debian, HTML, Microcontrollers, Privacy

RSS
Disclaimer
Privacy statement
Bugs statement
Cookies
Copyright © : 2014 - 2022 Webevaluation.nl and the authors
Changes reserved.

This website uses cookies. By using the website, you agree with storing cookies on your computer. Also you acknowledge that you have read and understand our Privacy Policy. If you do not agree leave the website.More information about cookies
websites_scraping.txt · Last modified: 24-05-2020 13:45 by wim