巨細！小姐姐告訴你關于 BeautifulSoup 的一切

作者：派森醬 2021-10-05 21:03:54

BeautifulSoup 用 NavigableString 類來包裝 tag 中的字符串，NavigableString 表示可遍歷的字符串。

[[427165]]

詳細了解 BeautifulSoup 爬蟲

前面第一篇文章是關于 BeautifulSoup 爬蟲的基礎知識詳解第一部分，主要介紹了 BeautifulSoup 爬蟲的安裝過程及簡介，同時又快速學習了利用 BeautifulSoup 技術定位標簽、獲取標簽內容的相關知識點，今天的文章將深入地介紹 BeautifulSoup 技術的詳細語法及其相關用法。

1.BeautifulSoup 對象

BeautifulSoup 將復雜的 HTML 文檔轉換成一個樹形結構，每個節點都是 Python 對象，BeautifulSoup 官方文檔將所有的對象歸納為以下四種：

Tag
NavigableString
BeautifulSoup
Comment

接下來詳細介紹 BeautifulSoup 的四個對象：

Tag

Tag 對象表示 XML 或 HTML 文檔中的標簽，通俗地講就是 HTML 中的一個個標簽，該對象與 HTML 或 XML 原生文檔中的標簽相同。Tag 有很多方法和屬性，BeautifulSoup 中定義為 soup.Tag，其中 Tag 為 HTML 中的標簽，比如 a、title 等，其結果返回完整的標簽內容，包括標簽的屬性和內容等。例如以下實例就是 Tag:

<title>BeautifulSoup 技術詳解</title> 
<p class="title">Hello</p> 
<p class="con">Python 技術</p>

以上的 HTML 代碼中，title、p 都是標簽，起始標簽和結束標簽之間加上內容就是 Tag。標簽獲取方法代碼如下：

#創建本地文件soup對象 
   soup = BeautifulSoup(open('test.html','rb'), "html.parser") 
   #獲取a標簽 
   a = soup.a  #Tag 
   print('a標簽的內容是:', a)

除此之外，Tag 中最重要的屬性是 name 和 attrs 。

name

name 屬性用于獲取文檔樹的標簽名字，如果想獲取 title 標簽的名字，只要使用 soup.title.name 代碼即可，對于內部標簽，輸出的值便為標簽本身的名稱。

attrsattrs是屬性(attributes)的英文簡稱，屬性是網頁標簽的重要內容。一個標簽(Tag)可能有很多個屬性，例如：

<a href="https://www.baidu.com" class="xiaodu" id="l1">ddd</a>

以上實例存在兩個屬性，一個是class屬性，對應的值為“xiaodu”;一個是id屬性，對應的值為“l1”。Tag屬性操作方法與Python字典相同，獲取p標簽的所有屬性代碼如下，得到一個字典類型的值，它獲取的是第一個段落 p 的屬性及屬性值。

# 獲取屬性 
print(soup.p.attrs) 
 
# 獲取屬性值 
print(soup.a['class']) 
#[u'xiaodu'] 
print(soup.a.get('class')) 
#[u'l1']

BeautifulSoup 每個標簽 tag 可能有很多個屬性，可以通過 “.attrs” 獲取屬性，tag 的屬性可以被修改、刪除或添加。

NavigableString

NavigableString 也叫可遍歷的字符串，字符串常被包含在 tag 內,BeautifulSoup 用 NavigableString 類來包裝tag中的字符串，

BeautifulSoup 用 NavigableString 類來包裝 tag 中的字符串，NavigableString 表示可遍歷的字符串。一個 NavigableString 字符串與 Python 中的 Unicode 字符串相同，并且支持包含在遍歷文檔樹和搜索文檔樹中的一些特性。下述代碼可查看 NavigableString 的類型。

# coding=utf-8 
from bs4 import BeautifulSoup 
soup = BeautifulSoup(open('test.html','rb'), "html.parser") 
tag = soup.title 
print(type(tag.string))

輸出結果如下：

<class 'bs4.element.NavigableString'>

BeautifulSoup

BeautifulSoup 對象表示的是一個文檔的全部內容，通常情況下把它當作 Tag 對象，該對象支持遍歷文檔樹和搜索文檔樹中描述的大部分的方法，下面代碼是輸出 soup 對象的類型，輸出結果就是 BeautifulSoup 對象類型。

# coding=utf-8 
from bs4 import BeautifulSoup 
soup = BeautifulSoup(open('test.html','rb'), "html.parser") 
tag = soup.title 
 
print(type(soup))

輸出結果如下：

<class 'bs4.BeautifulSoup'>

因為 BeautifulSoup 對象并不是真正的 HTML 或 XML 的標簽 tag，所以它沒有 name 和 attribute 屬性。但有時查看它的.name 屬性是很方便的，故 BeautifulSoup 對象包含了一個值為[document]的特殊屬性soup.name。下述代碼即是輸出 BeautifulSoup 對象的 name 屬性，其值為 [document]。

Comment

Comment 對象是一個特殊類型的 NavigableString 對象，它用于處理注釋對象。下面這個示例代碼用于讀取注釋內容，代碼如下：

markup = "<b><!-- hello comment code --></b>" 
    soup = BeautifulSoup(markup, "html.parser") 
    comment = soup.b.string 
    print(type(comment)) 
    print(comment) 
     
if __name__ == '__main__': 
    mark()

輸出結果如下：

<class 'bs4.BeautifulSoup'> 
<class 'bs4.element.Comment'> 
 hello comment code

2.遍歷文檔樹

以上內容講解完 4 個對象后，下面的知識講解遍歷文檔樹和搜索文檔樹以及 BeatifulSoup 常用的函數。在 BeautifulSoup 中，一個標簽(Tag)可能包含多個字符串或其它的標簽，這些稱為這個標簽的子標簽。

咱們繼續用以下超文本協議來講解：

<!DOCTYPE html> 
<html lang="en"> 
<head> 
    <title>BeautifulSoup 技術詳解</title> 
</head> 
<body> 
<p class="title">Hello</p> 
<p class="con">Python 技術</p> 
 
<a href="https://www.baidu.com" class="xiaodu" id="l1">ddd</a> 
 
</body> 
</html>

子節點

一個Tag可能包含多個字符串或其它的Tag,這些都是這個Tag的子節點，Beautiful Soup 提供了許多操作和遍歷子節點的屬性。

例如獲取標簽子節點內容：

# coding=utf-8 
from bs4 import BeautifulSoup 
soup = BeautifulSoup(open('test.html','rb'), "html.parser") 
tag = soup.title 
 
print(soup.head.contents)

輸出結果如下：

['\n', <title>BeautifulSoup 技術詳解</title>, '\n']

注意: Beautiful Soup中字符串節點不支持這些屬性,因為字符串沒有子節點。

節點內容

如果標簽只有一個子節點，需要獲取該子節點的內容，則需要使用 string 屬性，以此輸出節點的內容：

# coding=utf-8 
from bs4 import BeautifulSoup 
soup = BeautifulSoup(open('test.html','rb'), "html.parser") 
tag = soup.title 
 
print(soup.head.string) 
 
print(soup.title.string)

輸出結果如下：

None 
BeautifulSoup 技術詳解

父節點

調用 parent 屬性定位父節點，如果需要獲取節點的標簽名則使用 parent.name。實例如下：

# coding=utf-8 
from bs4 import BeautifulSoup 
soup = BeautifulSoup(open('test.html','rb'), "html.parser") 
tag = soup.title 
 
p = soup.p 
print(p.parent) 
print(p.parent.name) 
 
content = soup.head.title.string 
print(content.parent) 
print(content.parent.name)

輸出結果如下：

<body> 
<p class="title">Hello</p> 
<p class="con">Python 技術</p> 
<a class="xiaodu" href="https://www.baidu.com" id="l1">ddd</a> 
</body> 
body 
<title>BeautifulSoup 技術詳解</title> 
title

兄弟節點

兄弟節點是指和本節點位于同一級的節點，其中 next_sibling 屬性是獲取該節點的下一個兄弟節點，previous_sibling 則與之相反，取該節點的上一個兄弟節點，如果節點不存在，則返回 None。

print(soup.p.next_sibling) 
print(soup.p.prev_sibling)

前后節點

調用屬性 next_element 可以獲取下一個節點，調用屬性 previous_element 可以獲取上一個節點，代碼舉例如下：

print(soup.p.next_element) 
print(soup.p.previous_element)

3.搜索文檔樹

BeautifulSoup 定義了很多搜索方法，例如 find() 和 find_all(); 但find_all()是最常用的一種方法，而更多的方法與遍歷文檔樹類似，包括父節點、子節點、兄弟節點等，使用find_all()方法的代碼如下：

# coding=utf-8 
from bs4 import BeautifulSoup 
soup = BeautifulSoup(open('test.html','rb'), "html.parser") 
tag = soup.title 
 
urls = soup.find_all('p') 
for u in urls: 
    print(u)

輸出結果如下：

<p class="title">Hello</p> 
<p class="con">Python 技術</p>

使用 find_all() 可以查找到想要查找的文檔內容。

總結

至此，阿醬理解范圍內的 BeautifulSoup 基礎知識及用法基本上已經概述完畢，有差池的地方希望大家海涵，我們一起努力前行。

參考

BeautifulSoup 官網https://blog.csdn.net/Eastmount

責任編輯：武曉燕來源： Python技術

BeautifulSoup 爬蟲

成人免费xxxxx在线视频软件_久久精品久久久_亚洲国产精品久久久_天天色天天色_亚洲人成一区_欧美一级欧美三级在线观看

巨細！小姐姐告訴你關于 BeautifulSoup 的一切

[[427165]]

詳細了解 BeautifulSoup 爬蟲

1.BeautifulSoup 對象

2.遍歷文檔樹

3.搜索文檔樹

總結