Python 网页抓取-我爱分享网

在本文中，我们展示了如何使用Python进行网络抓取。我们使用多个Python库。

Web抓取是从网页中获取和提取数据。Webscraping用于收集和处理用于营销或研究的数据。这些数据包括职位列表、价格比较或社交媒体帖子。

Python是数据科学的热门选择。它包含许多用于网络抓取的库。要获取数据，我们可以利用requests或urllib3库。如果我们想创建异步客户端，可以使用httpx库。

要处理数据，我们可以使用lxml、pyquery或BeautifulSoup。这些库适用于静态数据。如果数据隐藏在JavaScript墙后面，我们可以使用Selenium或PlayWright库。

使用urllib3和lxml进行网络抓取

在第一个示例中，我们使用urllib3获取数据并使用lxml进行处理。

#!/usr/bin/python

import urllib3
from lxml import html

http = urllib3.PoolManager()

url = 'http://webcode.me'
resp = http.request('GET', url)

content = resp.data.decode('utf-8')
root = html.fromstring(content)

print('------------------------')

print(root.head.find(".//title").text)

print('------------------------')

for e in root:
    print(e.tag)

print('------------------------')

print(root.body.text_content().strip())

程序检索HTML标题、标签和HTMLbody的文本内容。

http = urllib3.PoolManager()

PoolManager已创建。它处理连接池和线程安全的所有细节。

url = 'http://webcode.me'
resp = http.request('GET', url)

我们向指定的URL生成GET请求。

content = resp.data.decode('utf-8')
root = html.fromstring(content)

我们获取并解码内容。我们解析字符串以创建lxml的HTML文档。

print(root.head.find(".//title").text)

我们打印文档的标题。

for e in root:
    print(e.tag)

这里我们打印文档第一层的所有标签。

print(root.body.text_content().strip())

我们打印HTML正文的文本数据。

$ ./main.py 
------------------------
My html page
------------------------
head
body
------------------------
Today is a beautiful day. We go swimming and fishing.
    
    
    
         Hello there. How are you?

使用请求和pyquery进行网络抓取

在第二个例子中，我们使用requests库来获取数据和pyquery来处理数据。

#!/usr/bin/python

from pyquery import PyQuery as pq
import requests as req

resp = req.get("http://www.webcode.me")
doc = pq(resp.text)

title = doc('title').text()
print(title)

pars = doc('p').text()
print(pars)

在示例中，我们从所有p标签中获取标题和文本数据。

resp = req.get("http://www.webcode.me")
doc = pq(resp.text)

我们生成一个GET请求并根据响应创建一个可解析的文档对象。

title = doc('title').text()
print(title)

我们从文档中获取标题标签并打印其文本。

$ ./main.py
My html page
Today is a beautiful day. We go swimming and fishing. Hello there. How are you?

Pythonscrape字典定义

在下一个示例中，我们从dictionary.com中抓取单词的定义。我们使用requests和lxml库。

#!/usr/bin/python

import requests as req
from lxml import html
import textwrap

term = "dog"

resp = req.get("http://www.dictionary.com/browse/" + term)
root = html.fromstring(resp.content)

for sel in root.xpath("//span[contains(@class, 'one-click-content')]"):

    if sel.text:

        s = sel.text.strip()

        if (len(s) > 3):

            print(textwrap.fill(s, width=50))

程序获取术语狗的定义。

import textwrap

textwrap模块用于将文本按一定宽度换行。

resp = req.get("http://www.dictionary.com/browse/" + term)

要执行搜索，我们会在URL的末尾附加该词。

root = html.fromstring(resp.content)

我们需要使用resp.content而不是resp.text因为html.fromstring隐式期望字节作为输入。（resp.content以字节为单位返回内容，而resp.text为Unicode文本。

for sel in root.xpath("//span[contains(@class, 'one-click-content')]"):

    if sel.text:

        s = sel.text.strip()

        if (len(s) > 3):

            print(textwrap.fill(s, width=50))

我们解析内容。主要定义位于span标记内，该标记具有one-click-content属性。我们通过删除过多的空格和杂散字符来改进格式。文本宽度最多为50个字符。请注意，此类解析可能会发生变化。

$ ./get_term.py
a domesticated canid,
any carnivore of the dog family Canidae, having
prominent canine teeth and, in the wild state, a
long and slender muzzle, a deep-chested muscular
body, a bushy tail, and large, erect ears.
...

使用BeautifulSoup进行Python网页抓取

BeautifulSoup是一个用于解析HTML和XML文档的Python库。它是最强大的网络抓取解决方案之一。

BeautifulSoup将复杂的HTML文档转换为复杂的Python对象树，例如标签、可导航字符串或注释。

#!/usr/bin/python

from bs4 import BeautifulSoup
import requests as req

resp = req.get('http://webcode.me')

soup = BeautifulSoup(resp.text, 'lxml')

print(soup.title)
print(soup.title.text)
print(soup.title.parent)

在示例中，我们获取了标题标签、标题文本和标题标签的父级。为了获取网页，我们使用请求库。

soup = BeautifulSoup(resp.text, 'lxml')

创建了一个BeautifulSoup对象；HTML数据被传递给构造函数。第二个选项指定内部解析器。

print(soup.title)
print(soup.title.text)
print(soup.title.parent)

我们使用内置属性获取数据。

$ ./main.py
<title>My html page</title>
My html page
<head>
<meta charset="utf-8"/>
<meta content="width=device-width, initial-scale=1.0" name="viewport"/>
<link href="format.css" rel="stylesheet"/>
<title>My html page</title>
</head>

Python抓取前5个国家

在下一个示例中，我们提取前5个人口最多的国家/地区。

#!/usr/bin/python

from bs4 import BeautifulSoup
import requests as req

resp = req.get('http://webcode.me/countries.html')

soup = BeautifulSoup(resp.text, 'lxml')

data = soup.select('tbody tr:nth-child(-n+5)')

for row in data:
    print(row.text.strip().replace('\n', ' '))

为了提取数据，我们使用执行CSSselection操作的select方法。

$ ./top_countries.py 
1 China 1382050000
2 India 1313210000
3 USA 324666000
4 Indonesia 260581000
5 Brazil 207221000

Python抓取动态内容

我们可以使用PlayWright或Selenium抓取动态内容。在我们的示例中，我们使用PlayWright库。

$ pip install --upgrade pip
$ pip install playwright
$ playwright install

我们安装PlayWright和驱动程序。

#!/usr/bin/python

from playwright.sync_api import sync_playwright

with sync_playwright() as p:

    browser = p.chromium.launch()

    page = browser.new_page()
    page.goto("http://webcode.me/click.html")

    page.click('button', button='left')
    print(page.query_selector('#output').text_content())

    browser.close()

网页上只有一个按钮。当我们点击按钮时，输出的div标签中会出现一条文本消息。

with sync_playwright() as p:

我们以同步模式工作。

browser = p.chromium.launch()

我们使用chromium浏览器。浏览器是无头的。

page = browser.new_page()
page.goto("http://webcode.me/click.html")

我们导航到页面。

page.click('button', button='left')

我们点击按钮。

print(page.query_selector('#output').text_content())

我们检索消息。

$ ./main.py 
Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) HeadlessChrome/109.0.5414.46 Safari/537.36

在本文中，我们展示了如何使用Python进行网络抓取。

列出所有Python教程。

开放的编程资料库

Python 网页抓取

使用urllib3和lxml进行网络抓取

使用请求和pyquery进行网络抓取

Pythonscrape字典定义

使用BeautifulSoup进行Python网页抓取

Python抓取前5个国家

Python抓取动态内容

感觉很棒！可以赞赏支持我哟~

相关推荐

近期文章

觉得文章有用就打赏一下文章作者

支付宝扫一扫打赏

微信扫一扫打赏

回顶部