Whatever your project size is, we will handle it well with all the standards fulfilled! We are here to give 100% satisfaction.
For job seekers, please visit our Career Page or send your resume to hr@actowizsolutions.com
Get sourced data you want to kick-start an App project
Although you tick all these points, you still require a domain associated dataset before you write a one line code. It is because contemporary applications use a huge amount of data at the same time or in batches for providing value for users.
In this blog, we will explain our workflow to generate such datasets. You would see how we deal with automated data scraping of different websites with no manual intervention.
Our objective is to produce a dataset to make a Price Comparison WebApp. A product category we will be utilizing as an example is hand- bags. For this application, price and product data of hand-bags need to be collected from various online-sellers every day. Though some sellers offer an API to access all the required details, not all follow the similar route. Therefore, using web scrapping is certain!
In this example, we will create web spiders for 10 sellers using Scrapy and Python. Then, we will automate the procedure using Apache Airflow and there will be no requirement for manual involvements to execute the whole procedure periodically.
You can get all associated source code in the GitHub repository.
Before we start any web data scraping project, we need to define which sites will get covered in this project. We have decided to include 10 websites that are the most stayed online stores in Turkey for hand bags. You can observe them in our GitHub repository.
You need to install Scrapy in your computer and create a Scrapy project before making any Scrapy spiders.
We made a folder structure in the local computer for neatly placing project files in separate folders.
A ‘csvFiles’ folder has a CSV file for all websites extracted. Spiders would be reading from the CSV files to find ‘starting URLs’ for initiating scraping as we do not need to hard-code them in spiders.
‘fashionWebScraping’ folder has Scrapy spiders with helper scripts including ‘pipelines.py’, ‘settings.py’, and ‘item.py’. We need to modify a few of Scrapy helper scripts for executing the scraping procedure successfully.
Product images extracted will get saved in an ‘images_scraped’ folder.
During the procedure of web data scraping, all the product data like pricing, name, product links and image links would be saved in JSON files within ‘jsonFiles’ folder.
There would be utility scripts to execute some tasks like;
After creation of project folders, the next step is populating the CSV files with starting URLs for every website we like to extract.
Nearly every e-commerce site provides pagination for navigating users through product list. Each time you navigate for next page, a page parameter within URL increases. Just go through the example URL given below, where a ‘page’ parameter gets used.
We will utilize {} placeholder to iterate URLs by incrementing values of ‘page’. We will utilize a ‘gender’ column within CSV file for defining gender categories of a particular URL.
Therefore, the last CSV file would look like that:
The similar principles applied to rest of sites in a project.
Step 1: Installing and Setting Up packages
To do web scraping, we need to modify ‘items.py’ for defining ‘item objects’ that are used for storing the extracted data.
To describe general output data formats Scrapy offers an Item class. These item objects are easy containers used for collecting the extracted data. They offer dictionary-like APIs with an easy syntax to declare their accessible fields.
using scrapy.org
After that, we need to modify ‘settings.py’. It is necessary to customize image pipelines and spiders’ behavior.
These Scrapy settings permit you in customizing the behavior of different Scrapy components like the core, pipelines, extensions, and spiders.
using scrapy.org
‘settings.py’ and ‘item.py’ are valid for different spiders in the project.
Spiders from Scrapy are the classes that define how certain sites (or groups of websites) will get scraped, together with how to do crawling (i.e. follow links) as well as how to scrape structured data using their pages (i.e. extracting items). Spiders are a place where you can define crawling’s custom behavior and parsing the pages for any particular website (or in a few cases, one group of websites).
using scrapy.org
The given shell command makes a clear spider file. It’s time to write codes in the fashionBOYNER.py file:
The spider class has 2 functions including ‘start_requests’ as well as ‘parse_product_pages’.
In function ‘start_requests’, we read from definite CSV file that we have already produced to get starting URL data. Then we repeat the placeholder {} for passing URLs of product pages into a ‘parse_product_pages’ function.
We could also pass ‘gender’ meta-data into ‘parse_product_pages’ function with ‘Request’ method using ‘meta={‘gender’: row[‘gender’]}’ stricture.
In ‘parse_product_pages’ function, we do the real web extraction and populate Scrapy items using the extracted data.
We use Xpath for locating HTML sections containing product data on a web page.
The initial Xpath expression given scrapes the entire product listing from current pages getting scrapped. All the necessary product data is contained within ‘div’ content elements.
We have to loop in ‘content’ for reaching individual products as well as storing them in Scrapy items. Using XPath expressions, we could easily find the essential HTML elements in ‘content’.
We have to loop in
With this scraping procedure, every product item is saved in the JSON file. Every website has a particular JSON file occupied with data in every spider run.
Use of jsonlines format can be more memory-efficient in comparison to JSON format, particularly if you scrape many web pages at one session.
Note that a JSON file name begins with ‘rawdata’ indicating that next step is checking and validating the extracted raw data before utilizing them in the application.
After the extraction procedure ends, you might have some items you need to remove from JSON files, before utilizing them in the application.
You might have some line items having duplicate values or null fields. Both cases need a correction procedure which we handle using ‘deldub.py’ and ‘jsonPrep.py’.
‘jsonPrep.py’ is looking for line items having null values as well as removes them if detected. You could find a code having explanations given below:
The results are saved with the file name begins with ‘prepdata’ in ‘jsonFiles’ folder after null line items get removed.
‘deldub.py’ needs duplicate line items as well as removes them if detected. You could find a code having explanations given below:
‘deldub.py’ needs duplicate line items as well as removes them if detected. You could find a code having explanations given below:
When we define the scraping procedure, we can jump into workflow automation. We will utilize Apache Airflow that is a Python-based automation tool made by Airbnb.
We will offer terminal commands to install and configure Apache Airflow.
In the Airflow, a DAG (Directed Acyclic Graph) is the collection of different tasks you need to run, well-organized in the way which reflects their dependencies and relationships.
For instance, an easy DAG can include three jobs: A, B, & C. This might indicate that A need to successfully run before B could run, however, C could run anytime. This indicates that job A times out afterwards 5 minutes, and B could get restarted around 5 times if it fails. This might also indicate that workflow would run each night at 10 pm, however shouldn’t begin until any certain date.
DAG’s that are defined in the Python file, is to organize a task flow. We would not define real tasks within a DAG file.
Let’s make a DAG folder with an empty Python file and start defining workflow using Python codes.
Many operators are there given by Airflow for describing the job within a DAG file. We have listed commonly utilized ones given below.
We are planning to utilize only ‘BashOperator’ as we would be completing different tasks using Python scripts.
By following the tutorial, we generated bash scripts to do every task. You could find them in the Github repository.
To begin a DAG workflow, we have to run an Airflow Scheduler. It will execute a scheduler using a specified configuration in the ‘airflow.cfg’ file. A scheduler monitors every task in every DAG positioned in a ‘dags’ folder as well as triggers the task execution if dependencies are met.
When we run an airflow scheduler, we could see status of the tasks through visiting http://0.0.0.0:8080 on the browser. Airflow offers a user interface in which we could see and observe scheduled dags.
We have shown here the web scraping workflow from starting till end.
Hopefully, it will assist you grasp the fundamentals of web scrapping with workflow automation.
For more details, contact Actowiz Solutions. You can also reach us for all your mobile app scraping and web scraping services requirements.
Web Scraping for FMCG Price Tracking offers real-time data, competitive insights, and pricing trends, helping businesses optimize strategies and boost profits.
Discover how AI, ML, and Web Scraping optimize grocery categorization with image recognition, NLP, and predictive analytics with Actowiz Solutions.
Actowiz Solutions' report unveils 2024 Black Friday grocery discounts, highlighting key pricing trends and insights to help businesses & shoppers save smarter.
This report explores women's fashion trends and pricing strategies in luxury clothing by analyzing data extracted from Gucci's website.
Discover how Actowiz Solutions' AI-Powered Web Scraping optimized a streaming platform’s content strategy through advanced Social Media Sentiment Analysis.
Discover how Actowiz Solutions leverages AI-driven web scraping to transform real estate market predictions. Gain insights into pricing trends and smarter investments.
Discover how LLMs compare to web scraping in data extraction. Explore their potential, limitations, and impact on the future of data collection.
Actowiz Solutions empowers businesses by scraping travel price data, enabling accurate comparisons to help users discover the best deals effortlessly.