如何使用lxml进行网页抓取?

时间:2020-10-22 06:50:11

标签: python web-scraping lxml.html

我想编写一个Python脚本,该脚本在堆栈溢出时获取我当前的声誉-https://stackoverflow.com/users/14483205/raunanza?tab=profile

这是我编写的代码。

from lxml import html 
import requests
page = requests.get('https://stackoverflow.com/users/14483205/raunanza?tab=profile')
tree = html.fromstring(page.content) 

现在,该怎么做才能赢得我的声誉。 (我什至不知道如何使用xpath
谷歌搜索之后。)

3 个答案:

答案 0 :(得分:0)

使用lxmlbeautifulsoup的简单解决方案:

from lxml import html
from bs4 import BeautifulSoup
import requests
page = requests.get('https://stackoverflow.com/users/14483205/raunanza?tab=profile').text
tree = BeautifulSoup(page, 'lxml')
name = tree.find("div", {'class': 'grid--cell fw-bold'}).text
title = tree.find("div", {'class': 'grid--cell fs-title fc-dark'}).text
print("Stackoverflow reputation of {}is: {}".format(name, title))
# output: Stackoverflow reputation of Raunanza is: 3

答案 1 :(得分:0)

如果您不介意使用BeautifulSoup,则可以直接从包含您的声誉的标签中提取文本。当然,您需要首先检查页面结构。

from bs4 import BeautifulSoup
import requests

page = requests.get('https://stackoverflow.com/users/14483205/raunanza?tab=profile')
soup = BeautifulSoup(page.content, features= 'lxml')

for tag in soup.find_all('strong', {'class': 'ml6 fc-medium'}):
    print(tag.text)
#this will output as 3
        

答案 2 :(得分:0)

您需要在代码中进行一些修改以获取xpath。下面是代码:

from lxml import HTML 
import requests

page = requests.get('https://stackoverflow.com/users/14483205/raunanza?tab=profile')
tree = html.fromstring(page.content) 
title = tree.xpath('//*[@id="avatar-card"]/div[2]/div/div[1]/text()')
print(title) #prints 3

您可以在chrome控制台中轻松获取element的xpath(检查选项)。 enter image description here

要了解有关xpath的更多信息,可以参考:https://www.w3schools.com/xml/xpath_examples.asp