Table of Contents
If you want to send us your comments, please do so. Thanks
More on comments
Scraping
HTML based websits
Use curl or wget to download a page and scrape it
Frontend frameworks
Some websites are build with modern frontend frameworks such as AngularJS, React or Vue.js. Their contents can not be downloaded with curl or wget. So how to get the contents of those website so it can be automatically processed?
Our goal is a lightweight solution
Conclusion
Scraping a frontend framework based webpage is a lot of work, compared to scraping a HTML based webpage, in learning and building code
Our solution:
- Avoid those websites if you think you need to scrape data from them at any time
- Do not buy products from those websites if you need information from it later like when a new firmware update is available
- Find an other source for the information which has an easy scrapeable website
- We will never use those frameworks if we can avoid them since we do not want to put up other people who want some information from our website with all the extra work and hassle
Results of our investigations
- casperjs Navigation scripting & testing for PhantomJS and SlimerJS
- CommonJS Seems deprecated since november 2014
- JSdom needs node.js. Made for scraping: “the goal of the project is to emulate enough of a subset of a web browser to be useful for testing and scraping real-world web applications.”
- Node.js® is a JavaScript runtime built on Chrome's V8 JavaScript engine. Node.js is an event-based server-side JavaScript engine
- PhantomJS development is suspended until further notice. PhantomJS version 2.1.1 is the last known stable release. PhantomJS runs on top of Webkit
- ReactJS is a JavaScript library for building user interfaces
- Request is a “Simplified HTTP request client.”
- RequireJS is a module loader
- SlimerJS runs on top of Gecko
Not light weight
Java based
HtmlUnit is a GUI-Less browser for Java programs
Google Chrome based
Headless Chrome Crawler installes Chromium and Puppeteer
HtmlUnit Driver This is currently the fastest and most lightweight implementation of WebDriver. As the name suggests, this is based on HtmlUnit
Selenium is an umbrella project encapsulating a variety of tools and libraries enabling web browser automation. It needs Google Chrome (privacy issues ahead) and the Chromedriver
Mozilla Firefox based
Using Firefox in Headless Mode
Main subjects on this wiki: Linux, Debian, HTML, Microcontrollers, Privacy
RSS
Disclaimer
Privacy statement
Bugs statement
Cookies
Copyright © : 2014 - 2024 Webevaluation.nl and the authors
Changes reserved.