TABLE OF CONTENTS
Why Using This Best Practice
We strongly encourage using this process, among the infinite possible, because we want us and you to work better, work less, and work smarter. This is an open discussion, but the goals are common:
- Resilient execution: We want the code to be as low maintenance as possible
- Faster maintenance: We all work smarter if we find standard solutions, and not having to decode creative creations every time. Want to be creative? Help us find a better standard, so we all grow smarter together.
- Regulatory compliance: web scraping is a serious thing, we need to know exactly what tools are used. Stay inside the tracks, it's mutually beneficial.
Keep an eye on this pricess as it continuously evolves!
1. Preliminary Study
1.1. Technology Stack
Perform a technology stack evaluation for the target website using Wappalyzer Chrome Extension, with attention in the "Security" block.
When a technology stack is detected under the "Seciruty" section, please verify if in this list of technologies there is a specific solution for that technology.
1.2. API search
Has the website some internal or public APIs for fetching the price\product data? If so, this is the best scenario available and we should use them to gather data
1.3. JSON in HTML Search
Sometimes websites have JSON in their HTML, not only when there's an API. Finding this, will ensure stability.
1.3. Pagination
How the website handles the pagination of product catalogue? Internal services that provide the html code of the catalogue are preferred vs loading the full page code
2. Code Best Practices
2.1. JSON
Use json if available (on html of the page or from API). It's less prone to changes
2.2. XPATHS
Use Xpaths, not css selectors for getting a clearer code.
2.3. Indent using TABS
Use tabs for indentation instead of spaces - code weights less and it's easier to detect badly indented structure
2.4. No formatting rules in Price
Don't insert rules for cleaning price or numeric fields: formats change over different countries and are not standards, let's keep this task to post scraping phases in the DBs.
2.5. Product List Page wins on Single Product Page
Load the fewer pages you can. Try to see if the fields you need are all available from product list pages and try avoiding enter the single product page.
Document History
- 2022-03-29 Document created