国产av日韩一区二区三区精品,成人性爱视频在线观看,国产,欧美,日韩,一区,www.成色av久久成人,2222eeee成人天堂

Table of Contents
Access the complete code on Google Colab
Why Choose Crawl4AI and Pydantic?
Why Target Tokopedia?
What Sets This Approach Apart?
Setting Up Your Development Environment
Defining Data Models with Pydantic
The Scraping Process
1. Crawling Product Listings
2. Fetching Product Details
Combining the Stages
Running the Scraper
Pro Tips
Next Steps
Conclusion
Important Links:
Crawl4AI
Pydantic
Note: The complete code is available in the Colab notebook. Feel free to experiment and adapt it to your specific needs.
Home Backend Development Python Tutorial Building an Async E-Commerce Web Scraper with Pydantic, Crawl & Gemini

Building an Async E-Commerce Web Scraper with Pydantic, Crawl & Gemini

Jan 12, 2025 am 06:25 AM

Building an Async E-Commerce Web Scraper with Pydantic, Crawl & Gemini

In short: This guide demonstrates building an e-commerce scraper using crawl4ai's AI-powered extraction and Pydantic data models. The scraper asynchronously retrieves both product listings (names, prices) and detailed product information (specifications, reviews).

Access the complete code on Google Colab


Tired of the complexities of traditional web scraping for e-commerce data analysis? This tutorial simplifies the process using modern Python tools. We'll leverage crawl4ai for intelligent data extraction and Pydantic for robust data modeling and validation.

Why Choose Crawl4AI and Pydantic?

  • crawl4ai: Streamlines web crawling and scraping using AI-driven extraction methods.
  • Pydantic: Provides data validation and schema management, ensuring structured and accurate scraped data.

Why Target Tokopedia?

Tokopedia, a major Indonesian e-commerce platform, serves as our example. (Note: The author is Indonesian and a user of the platform, but not affiliated.) The principles apply to other e-commerce sites. This scraping approach is beneficial for developers interested in e-commerce analytics, market research, or automated data collection.

What Sets This Approach Apart?

Instead of relying on complex CSS selectors or XPath, we utilize crawl4ai's LLM-based extraction. This offers:

  • Enhanced resilience to website structure changes.
  • Cleaner, more structured data output.
  • Reduced maintenance overhead.

Setting Up Your Development Environment

Begin by installing necessary packages:

%pip install -U crawl4ai
%pip install nest_asyncio
%pip install pydantic

For asynchronous code execution in notebooks, we'll also use nest_asyncio:

import crawl4ai
import asyncio
import nest_asyncio
nest_asyncio.apply()

Defining Data Models with Pydantic

We use Pydantic to define the expected data structure. Here are the models:

from pydantic import BaseModel, Field
from typing import List, Optional

class TokopediaListingItem(BaseModel):
    product_name: str = Field(..., description="Product name from listing.")
    product_url: str = Field(..., description="URL to product detail page.")
    price: str = Field(None, description="Price displayed in listing.")
    store_name: str = Field(None, description="Store name from listing.")
    rating: str = Field(None, description="Rating (1-5 scale) from listing.")
    image_url: str = Field(None, description="Primary image URL from listing.")

class TokopediaProductDetail(BaseModel):
    product_name: str = Field(..., description="Product name from detail page.")
    all_images: List[str] = Field(default_factory=list, description="List of all product image URLs.")
    specs: str = Field(None, description="Technical specifications or short info.")
    description: str = Field(None, description="Long product description.")
    variants: List[str] = Field(default_factory=list, description="List of variants or color options.")
    satisfaction_percentage: Optional[str] = Field(None, description="Customer satisfaction percentage.")
    total_ratings: Optional[str] = Field(None, description="Total number of ratings.")
    total_reviews: Optional[str] = Field(None, description="Total number of reviews.")
    stock: Optional[str] = Field(None, description="Stock availability.")

These models serve as templates, ensuring data validation and providing clear documentation.

The Scraping Process

The scraper operates in two phases:

1. Crawling Product Listings

First, we retrieve search results pages:

async def crawl_tokopedia_listings(query: str = "mouse-wireless", max_pages: int = 1):
    # ... (Code remains the same) ...

2. Fetching Product Details

Next, for each product URL, we fetch detailed information:

async def crawl_tokopedia_detail(product_url: str):
    # ... (Code remains the same) ...

Combining the Stages

Finally, we integrate both phases:

async def run_full_scrape(query="mouse-wireless", max_pages=2, limit=15):
    # ... (Code remains the same) ...

Running the Scraper

Here's how to execute the scraper:

%pip install -U crawl4ai
%pip install nest_asyncio
%pip install pydantic

Pro Tips

  1. Rate Limiting: Respect Tokopedia's servers; introduce delays between requests for large-scale scraping.
  2. Caching: Enable crawl4ai's caching during development (cache_mode=CacheMode.ENABLED).
  3. Error Handling: Implement comprehensive error handling and retry mechanisms for production use.
  4. API Keys: Store Gemini API keys securely in environment variables, not directly in the code.

Next Steps

This scraper can be extended to:

  • Store data in a database.
  • Monitor price changes over time.
  • Analyze product trends and patterns.
  • Compare prices across multiple stores.

Conclusion

crawl4ai's LLM-based extraction significantly improves web scraping maintainability compared to traditional methods. The integration with Pydantic ensures data accuracy and structure.

Always adhere to a website's robots.txt and terms of service before scraping.


Crawl4AI

Pydantic


Note: The complete code is available in the Colab notebook. Feel free to experiment and adapt it to your specific needs.

The above is the detailed content of Building an Async E-Commerce Web Scraper with Pydantic, Crawl & Gemini. For more information, please follow other related articles on the PHP Chinese website!

Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undress AI Tool

Undress AI Tool

Undress images for free

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Clothoff.io

Clothoff.io

AI clothes remover

Video Face Swap

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Hot Tools

Notepad++7.3.1

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

SublimeText3 Mac version

God-level code editing software (SublimeText3)

What are dynamic programming techniques, and how do I use them in Python? What are dynamic programming techniques, and how do I use them in Python? Jun 20, 2025 am 12:57 AM

Dynamic programming (DP) optimizes the solution process by breaking down complex problems into simpler subproblems and storing their results to avoid repeated calculations. There are two main methods: 1. Top-down (memorization): recursively decompose the problem and use cache to store intermediate results; 2. Bottom-up (table): Iteratively build solutions from the basic situation. Suitable for scenarios where maximum/minimum values, optimal solutions or overlapping subproblems are required, such as Fibonacci sequences, backpacking problems, etc. In Python, it can be implemented through decorators or arrays, and attention should be paid to identifying recursive relationships, defining the benchmark situation, and optimizing the complexity of space.

How do I perform network programming in Python using sockets? How do I perform network programming in Python using sockets? Jun 20, 2025 am 12:56 AM

Python's socket module is the basis of network programming, providing low-level network communication functions, suitable for building client and server applications. To set up a basic TCP server, you need to use socket.socket() to create objects, bind addresses and ports, call .listen() to listen for connections, and accept client connections through .accept(). To build a TCP client, you need to create a socket object and call .connect() to connect to the server, then use .sendall() to send data and .recv() to receive responses. To handle multiple clients, you can use 1. Threads: start a new thread every time you connect; 2. Asynchronous I/O: For example, the asyncio library can achieve non-blocking communication. Things to note

How do I slice a list in Python? How do I slice a list in Python? Jun 20, 2025 am 12:51 AM

The core answer to Python list slicing is to master the [start:end:step] syntax and understand its behavior. 1. The basic format of list slicing is list[start:end:step], where start is the starting index (included), end is the end index (not included), and step is the step size; 2. Omit start by default start from 0, omit end by default to the end, omit step by default to 1; 3. Use my_list[:n] to get the first n items, and use my_list[-n:] to get the last n items; 4. Use step to skip elements, such as my_list[::2] to get even digits, and negative step values ??can invert the list; 5. Common misunderstandings include the end index not

How do I use the datetime module for working with dates and times in Python? How do I use the datetime module for working with dates and times in Python? Jun 20, 2025 am 12:58 AM

Python's datetime module can meet basic date and time processing requirements. 1. You can get the current date and time through datetime.now(), or you can extract .date() and .time() respectively. 2. Can manually create specific date and time objects, such as datetime(year=2025, month=12, day=25, hour=18, minute=30). 3. Use .strftime() to output strings in format. Common codes include %Y, %m, %d, %H, %M, and %S; use strptime() to parse the string into a datetime object. 4. Use timedelta for date shipping

Polymorphism in python classes Polymorphism in python classes Jul 05, 2025 am 02:58 AM

Polymorphism is a core concept in Python object-oriented programming, referring to "one interface, multiple implementations", allowing for unified processing of different types of objects. 1. Polymorphism is implemented through method rewriting. Subclasses can redefine parent class methods. For example, the spoke() method of Animal class has different implementations in Dog and Cat subclasses. 2. The practical uses of polymorphism include simplifying the code structure and enhancing scalability, such as calling the draw() method uniformly in the graphical drawing program, or handling the common behavior of different characters in game development. 3. Python implementation polymorphism needs to satisfy: the parent class defines a method, and the child class overrides the method, but does not require inheritance of the same parent class. As long as the object implements the same method, this is called the "duck type". 4. Things to note include the maintenance

How do I write a simple 'Hello, World!' program in Python? How do I write a simple 'Hello, World!' program in Python? Jun 24, 2025 am 12:45 AM

The "Hello,World!" program is the most basic example written in Python, which is used to demonstrate the basic syntax and verify that the development environment is configured correctly. 1. It is implemented through a line of code print("Hello,World!"), and after running, the specified text will be output on the console; 2. The running steps include installing Python, writing code with a text editor, saving as a .py file, and executing the file in the terminal; 3. Common errors include missing brackets or quotes, misuse of capital Print, not saving as .py format, and running environment errors; 4. Optional tools include local text editor terminal, online editor (such as replit.com)

What are tuples in Python, and how do they differ from lists? What are tuples in Python, and how do they differ from lists? Jun 20, 2025 am 01:00 AM

TuplesinPythonareimmutabledatastructuresusedtostorecollectionsofitems,whereaslistsaremutable.Tuplesaredefinedwithparenthesesandcommas,supportindexing,andcannotbemodifiedaftercreation,makingthemfasterandmorememory-efficientthanlists.Usetuplesfordatain

How do I generate random strings in Python? How do I generate random strings in Python? Jun 21, 2025 am 01:02 AM

To generate a random string, you can use Python's random and string module combination. The specific steps are: 1. Import random and string modules; 2. Define character pools such as string.ascii_letters and string.digits; 3. Set the required length; 4. Call random.choices() to generate strings. For example, the code includes importrandom and importstring, set length=10, characters=string.ascii_letters string.digits and execute ''.join(random.c

See all articles