国产av日韩一区二区三区精品,成人性爱视频在线观看,国产,欧美,日韩,一区,www.成色av久久成人,2222eeee成人天堂

Table of Contents
1. Introduction to the BeautifulSoup4 library
1. Introduction
2. Download the module
二、上手操作
1. 基礎(chǔ)操作
2. 對象種類
3. 搜索文檔樹
4. css選擇器
Home Backend Development Python Tutorial Understand the Python crawler parser BeautifulSoup4 in one article

Understand the Python crawler parser BeautifulSoup4 in one article

Jul 12, 2022 pm 04:56 PM
python

This article brings you relevant knowledge about Python, which mainly sorts out issues related to the crawler parser BeautifulSoup4. Beautiful Soup is a Python that can extract data from HTML or XML files. Library, which can implement the usual methods of document navigation, search, and modification of documents through your favorite converter. Let’s take a look at it. I hope it will be helpful to everyone.

Understand the Python crawler parser BeautifulSoup4 in one article

[Related recommendations: Python3 video tutorial ]

1. Introduction to the BeautifulSoup4 library

1. Introduction

Beautiful Soup is a Python library that can extract data from HTML or XML files. It can implement the usual ways of document navigation, search, and modification of documents through your favorite converter. Beautiful Soup will help You save hours or even days of work time.

BeautifulSoup4 converts the web page into a DOM tree:
Understand the Python crawler parser BeautifulSoup4 in one article

2. Download the module

1. Click on the window computerwin key R, enter: cmd

Understand the Python crawler parser BeautifulSoup4 in one article

##2. Install beautifulsoup4, Enter the corresponding pip command : pip install beautifulsoup4 . I have already installed the version that appears and the installation was successful.

Understand the Python crawler parser BeautifulSoup4 in one article
3. Guide package

form?bs4?import?BeautifulSoup
3. Parsing library

BeautifulSoup actually relies on the parser when parsing. In addition to supporting the HTML parser in the Python standard library, it also supports some third-party parsers ( Such as lxml):

ParserUsageAdvantagesDisadvantagesPython standard libraryPython's built-in standard library, execution speed Moderate, strong document fault tolerancePython 2.7.3 and versions before Python3.2.2 have poor document fault tolerancelxml HTML parsing libraryFast speed, strong document fault toleranceNeed to install C language librarylxml XML parsing libraryFast speed, the only parser that supports XMLRequires the installation of C language libraryhtm5lib parsing libraryThe best fault tolerance, browser way Parse documents and generate documents in HTMLS formatSlow speed, does not rely on external extensions

對于我們來說,我們最常使用的解析器是lxml HTML解析器,其次是html5lib.

二、上手操作

1. 基礎(chǔ)操作

1. 讀取HTML字符串:

from?bs4?import?BeautifulSoup

html?=?'''
<p>
????</p><p>
????????</p><h4>Hello</h4>
????
????<p>
????????</p>
    ????????????
  • Foo
  • ????????????
  • Bar
  • ????????????
  • Jay
  • ????????
???????? ???? ? '''#?創(chuàng)建對象soup?=?BeautifulSoup(html,?'lxml')

2. 讀取HTML文件

from?bs4?import?BeautifulSoup

soup?=?BeautifulSoup(open('index.html'),'lxml')

3. 基本方法

from?bs4?import?BeautifulSoup

html?=?'''
<p>
????</p><p>
????????</p><h4>Hello</h4>
????
????<p>
????????</p>
    ????????????
  • Foo
  • ????????????
  • Bar
  • ????????????
  • Jay
  • ????????
???????? ???? ? '''#?創(chuàng)建對象soup?=?BeautifulSoup(html,?'lxml')#?縮進格式print(soup.prettify())#?獲取title標簽的所有內(nèi)容print(soup.title)#?獲取title標簽的名稱print(soup.title.name)#?獲取title標簽的文本內(nèi)容print(soup.title.string)#?獲取head標簽的所有內(nèi)容print(soup.head)#?獲取第一個p標簽中的所有內(nèi)容print(soup.p)#?獲取第一個p標簽的id的值print(soup.p["id"])#?獲取第一個a標簽中的所有內(nèi)容print(soup.a)#?獲取所有的a標簽中的所有內(nèi)容print(soup.find_all("a"))#?獲取id="u1"print(soup.find(id="u1"))#?獲取所有的a標簽,并遍歷打印a標簽中的href的值for?item?in?soup.find_all("a"): ????print(item.get("href"))#?獲取所有的a標簽,并遍歷打印a標簽的文本值for?item?in?soup.find_all("a"): ????print(item.get_text())

2. 對象種類

Beautiful Soup將復(fù)雜HTML文檔轉(zhuǎn)換成一個復(fù)雜的樹形結(jié)構(gòu),每個節(jié)點都是Python對象,所有對象可以歸納為4種: Tag , NavigableString , BeautifulSoup , Comment .

(1)Tag:Tag通俗點講就是HTML中的一個個標簽,例如:

soup?=?BeautifulSoup('<b>Extremely?bold</b>','lxml')
tag?=?soup.b
print(tag)
print(type(tag))

輸出結(jié)果:

<b>Extremely?bold</b>
<class></class>

Tag有很多方法和屬性,在 遍歷文檔樹 和 搜索文檔樹 中有詳細解釋.現(xiàn)在介紹一下tag中最重要的屬性: nameattributes

name屬性:

print(tag.name)
#?輸出結(jié)果:b
#?如果改變了tag的name,那將影響所有通過當前Beautiful?Soup對象生成的HTML文檔:
tag.name?=?"b1"
print(tag)
#?輸出結(jié)果:<b1>Extremely?bold</b1>

Attributes屬性:

#?取clas屬性
print(tag['class'])

#?直接”點”取屬性,?比如:?.attrs?:
print(tag.attrs)

tag 的屬性可以被添加、修改和刪除:

#?添加?id?屬性
tag['id']?=?1

#?修改?class?屬性
tag['class']?=?'tl1'

#?刪除?class?屬性
del?tag['class']

(2)NavigableString:用.string獲取標簽內(nèi)部的文字:

print(soup.b.string)print(type(soup.b.string))

(3)BeautifulSoup:表示的是一個文檔的內(nèi)容,可以獲取它的類型,名稱,以及屬性:

print(type(soup.name))
#?<type>

print(soup.name)
#?[document]

print(soup.attrs)
#?文檔本身的屬性為空</type>

(4)Comment:是一個特殊類型的 NavigableString 對象,其輸出的內(nèi)容不包括注釋符號。

print(soup.b)

print(soup.b.string)

print(type(soup.b.string))

3. 搜索文檔樹

1.find_all(name, attrs, recursive, text, **kwargs)

(1)name 參數(shù):name 參數(shù)可以查找所有名字為 name 的tag,字符串對象會被自動忽略掉

  • 匹配字符串:查找與字符串完整匹配的內(nèi)容,用于查找文檔中所有的<a></a>標簽

    a_list?=?soup.find_all("a")print(a_list)
  • 匹配正則表達式:如果傳入正則表達式作為參數(shù),Beautiful Soup會通過正則表達式的 match() 來匹配內(nèi)容

    #?返回所有表示和<b>標簽for?tag?in?soup.find_all(re.compile("^b")):
    ????print(tag.name)</b>
  • 匹配列表:如果傳入列表參數(shù),Beautiful Soup會將與列表中任一元素匹配的內(nèi)容返回

    #?返回所有所有<p>標簽和<a>標簽:soup.find_all(["p",?"a"])</a></p>

(2)kwargs參數(shù)

soup.find_all(id='link2')

(3)text參數(shù):通過 text 參數(shù)可以搜搜文檔中的字符串內(nèi)容,與 name 參數(shù)的可選值一樣, text 參數(shù)接受 字符串 , 正則表達式 , 列表

#?匹配字符串
soup.find_all(text="a")

#?匹配正則
soup.find_all(text=re.compile("^b"))

#?匹配列表
soup.find_all(text=["p",?"a"])

4. css選擇器

我們在使用BeautifulSoup解析庫時,經(jīng)常會結(jié)合CSS選擇器來提取數(shù)據(jù)。

注意:以下講解CSS選擇器只選擇標簽,至于獲取屬性值和文本內(nèi)容我們后面再講。

1. 根據(jù)標簽名查找:比如寫一個 li 就會選擇所有l(wèi)i 標簽, 不過我們一般不用,因為我們都是精確到標簽再提取數(shù)據(jù)的

from?bs4?import?BeautifulSoup

html?=?'''
<p>
????</p><p>
????????</p><h4>Hello</h4>
????
????<p>
????????</p>
    ????????????
  • Foo
  • ????????????
  • Bar
  • ????????????
  • Jay
  • ????????
???????? ???? ? '''#?創(chuàng)建對象soup?=?BeautifulSoup(html,?'lxml')#?1.?根據(jù)標簽名查找:查找li標簽print(soup.select("li"))

輸出結(jié)果:

[
  • Foo
  • ,?
  • Bar
  • ,?
  • Jay
  • ,?
  • Foo
  • ,?
  • Bar
  • ]

    2. 根據(jù)類名class查找。.1ine, 即一個點加line,這個表達式選的是class= "line "的所有標簽,".”代表class

    print(soup.select(".panel_body"))

    輸出結(jié)果:

    • Foo
    • Bar
    ]

    3. 根據(jù)id查找。#box,即一個#和box表示選取id-”box "的所有標簽,“#”代表id

    print(soup.select("#list-1"))

    輸出結(jié)果:

    [
    • Foo
    • Bar
    • Jay
    ]

    4. 根據(jù)屬性的名字查找。class屬性和id屬性較為特殊,故單獨拿出來定義一個". "“”來表示他們。

    比如:input[ name=“username”]這個表達式查找name= "username "的標簽,此處注意和xpath語法的區(qū)別

    print(soup.select('ul[?name="element"]'))

    輸出結(jié)果:

    [
    • Foo
    • Bar
    • Jay
    ]

    5. 標簽+類名或id的形式。

    #?查找id為list-1的ul標簽
    print(soup.select('ul#list-1'))
    print("-"*20)
    #?查找class為list的ul標簽
    print(soup.select('ul.list'))

    輸出結(jié)果:

    [
    • Foo
    • Bar
    • Jay
    ] -------------------- [
    • Foo
    • Bar
    • Jay
    ,?
    • Foo
    • Bar
    ]

    6. 查找直接子元素

    #?查找id="list-1"的標簽下的直接子標簽liprint(soup.select('#list-1>li'))

    輸出結(jié)果:

    [
  • Foo
  • ,?
  • Bar
  • ,?
  • Jay
  • ]

    7. 查找子孫標簽

    #?.panel_body和li之間是一個空格,這個表達式查找id=”.panel_body”的標簽下的子或?qū)O標簽liprint(soup.select('.panel_body?li'))

    輸出結(jié)果:

    [
  • Foo
  • ,?
  • Bar
  • ,?
  • Jay
  • ,?
  • Foo
  • ,?
  • Bar
  • ]

    8. 取某個標簽的屬性

    #?1.?先取到<p>p?=?soup.select(".panel_body")[0]#?2.?再去下面的a標簽下的href屬性print(p.select('a')[0]["href"])</p>

    輸出結(jié)果:

    https://www.baidu.com

    9. 獲取文本內(nèi)容有四種方式:

    (a) string:獲得某個標簽下的文本內(nèi)容,強調(diào)-一個標簽,不含嵌我。 返回-個字符串

    #?1.?先取到<p>p?=?soup.select(".panel_body")[0]#?2.?再去下面的a標簽下print(p.select('a')[0].string)</p>

    輸出結(jié)果:

    百度官網(wǎng)

    (b) strings:獲得某個標簽下的所有文本內(nèi)容,可以嵌套。返回-一個生成器,可用list(生成器)轉(zhuǎn)換為列表

    print(p.strings)print(list(p.strings))

    輸出結(jié)果:

    <generator>['\n',?'\n',?'Foo',?'\n',?'Bar',?'\n',?'Jay',?'\n',?'\n',?'\n',?'Foo',?'\n',?'百度官網(wǎng)',?'\n',?'Bar',?'\n',?'\n']</generator>

    (c)stripped.strings:跟(b)差不多,只不過它會去掉每個字符串頭部和尾部的空格和換行符

    print(p.stripped_strings)print(list(p.stripped_strings))

    輸出結(jié)果:

    <generator>['Foo',?'Bar',?'Jay',?'Foo',?'百度官網(wǎng)',?'Bar']</generator>

    (d) get.text():獲取所有字符串,含嵌套. 不過會把所有字符串拼接為一個,然后返回
    注意2:
    前3個都是屬性,不加括號;最后一個是函數(shù),加括號。

    print(p.get_text())

    輸出結(jié)果:

    Foo
    Bar
    Jay
    
    
    Foo
    百度官網(wǎng)
    Bar

    【相關(guān)推薦:Python3視頻教程

    BeautifulSoup(html,'html.parser')
    BeautifulSoup(html,'lxml')
    BeautifulSoup(html,'xml'
    BeautifulSoup(html,'htm5llib')

    The above is the detailed content of Understand the Python crawler parser BeautifulSoup4 in one article. For more information, please follow other related articles on the PHP Chinese website!

    Statement of this Website
    The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

    Hot AI Tools

    Undress AI Tool

    Undress AI Tool

    Undress images for free

    Undresser.AI Undress

    Undresser.AI Undress

    AI-powered app for creating realistic nude photos

    AI Clothes Remover

    AI Clothes Remover

    Online AI tool for removing clothes from photos.

    Clothoff.io

    Clothoff.io

    AI clothes remover

    Video Face Swap

    Video Face Swap

    Swap faces in any video effortlessly with our completely free AI face swap tool!

    Hot Tools

    Notepad++7.3.1

    Notepad++7.3.1

    Easy-to-use and free code editor

    SublimeText3 Chinese version

    SublimeText3 Chinese version

    Chinese version, very easy to use

    Zend Studio 13.0.1

    Zend Studio 13.0.1

    Powerful PHP integrated development environment

    Dreamweaver CS6

    Dreamweaver CS6

    Visual web development tools

    SublimeText3 Mac version

    SublimeText3 Mac version

    God-level code editing software (SublimeText3)

    Hot Topics

    PHP Tutorial
    1502
    276
    PHP calls AI intelligent voice assistant PHP voice interaction system construction PHP calls AI intelligent voice assistant PHP voice interaction system construction Jul 25, 2025 pm 08:45 PM

    User voice input is captured and sent to the PHP backend through the MediaRecorder API of the front-end JavaScript; 2. PHP saves the audio as a temporary file and calls STTAPI (such as Google or Baidu voice recognition) to convert it into text; 3. PHP sends the text to an AI service (such as OpenAIGPT) to obtain intelligent reply; 4. PHP then calls TTSAPI (such as Baidu or Google voice synthesis) to convert the reply to a voice file; 5. PHP streams the voice file back to the front-end to play, completing interaction. The entire process is dominated by PHP to ensure seamless connection between all links.

    How to use PHP combined with AI to achieve text error correction PHP syntax detection and optimization How to use PHP combined with AI to achieve text error correction PHP syntax detection and optimization Jul 25, 2025 pm 08:57 PM

    To realize text error correction and syntax optimization with AI, you need to follow the following steps: 1. Select a suitable AI model or API, such as Baidu, Tencent API or open source NLP library; 2. Call the API through PHP's curl or Guzzle and process the return results; 3. Display error correction information in the application and allow users to choose whether to adopt it; 4. Use php-l and PHP_CodeSniffer for syntax detection and code optimization; 5. Continuously collect feedback and update the model or rules to improve the effect. When choosing AIAPI, focus on evaluating accuracy, response speed, price and support for PHP. Code optimization should follow PSR specifications, use cache reasonably, avoid circular queries, review code regularly, and use X

    python seaborn jointplot example python seaborn jointplot example Jul 26, 2025 am 08:11 AM

    Use Seaborn's jointplot to quickly visualize the relationship and distribution between two variables; 2. The basic scatter plot is implemented by sns.jointplot(data=tips,x="total_bill",y="tip",kind="scatter"), the center is a scatter plot, and the histogram is displayed on the upper and lower and right sides; 3. Add regression lines and density information to a kind="reg", and combine marginal_kws to set the edge plot style; 4. When the data volume is large, it is recommended to use "hex"

    PHP integrated AI emotional computing technology PHP user feedback intelligent analysis PHP integrated AI emotional computing technology PHP user feedback intelligent analysis Jul 25, 2025 pm 06:54 PM

    To integrate AI sentiment computing technology into PHP applications, the core is to use cloud services AIAPI (such as Google, AWS, and Azure) for sentiment analysis, send text through HTTP requests and parse returned JSON results, and store emotional data into the database, thereby realizing automated processing and data insights of user feedback. The specific steps include: 1. Select a suitable AI sentiment analysis API, considering accuracy, cost, language support and integration complexity; 2. Use Guzzle or curl to send requests, store sentiment scores, labels, and intensity information; 3. Build a visual dashboard to support priority sorting, trend analysis, product iteration direction and user segmentation; 4. Respond to technical challenges, such as API call restrictions and numbers

    python list to string conversion example python list to string conversion example Jul 26, 2025 am 08:00 AM

    String lists can be merged with join() method, such as ''.join(words) to get "HelloworldfromPython"; 2. Number lists must be converted to strings with map(str, numbers) or [str(x)forxinnumbers] before joining; 3. Any type list can be directly converted to strings with brackets and quotes, suitable for debugging; 4. Custom formats can be implemented by generator expressions combined with join(), such as '|'.join(f"[{item}]"foriteminitems) output"[a]|[

    python connect to sql server pyodbc example python connect to sql server pyodbc example Jul 30, 2025 am 02:53 AM

    Install pyodbc: Use the pipinstallpyodbc command to install the library; 2. Connect SQLServer: Use the connection string containing DRIVER, SERVER, DATABASE, UID/PWD or Trusted_Connection through the pyodbc.connect() method, and support SQL authentication or Windows authentication respectively; 3. Check the installed driver: Run pyodbc.drivers() and filter the driver name containing 'SQLServer' to ensure that the correct driver name is used such as 'ODBCDriver17 for SQLServer'; 4. Key parameters of the connection string

    python pandas melt example python pandas melt example Jul 27, 2025 am 02:48 AM

    pandas.melt() is used to convert wide format data into long format. The answer is to define new column names by specifying id_vars retain the identification column, value_vars select the column to be melted, var_name and value_name, 1.id_vars='Name' means that the Name column remains unchanged, 2.value_vars=['Math','English','Science'] specifies the column to be melted, 3.var_name='Subject' sets the new column name of the original column name, 4.value_name='Score' sets the new column name of the original value, and finally generates three columns including Name, Subject and Score.

    Optimizing Python for Memory-Bound Operations Optimizing Python for Memory-Bound Operations Jul 28, 2025 am 03:22 AM

    Pythoncanbeoptimizedformemory-boundoperationsbyreducingoverheadthroughgenerators,efficientdatastructures,andmanagingobjectlifetimes.First,usegeneratorsinsteadofliststoprocesslargedatasetsoneitematatime,avoidingloadingeverythingintomemory.Second,choos

    See all articles