Question

我正在使用lxml和request模块，只是试图从新闻网站解析文章，这是示例文章的链接：https://www.thehindu.com/news/national/karnataka/kumaraswamy-congress-leaders-meet-to-discuss-cabinet-reshuffle/article27283040.ece

如果您检查文章的正文，则可以看到它位于名为“ article”的div类中。我正在尝试为此类分析文章，但是我总是空白。没有错误或任何东西，只是没有被发现。

我也尝试使用BeautifulSoup的find_all，但仍然空着

from lxml import html
import requests

page = requests.get('https://www.thehindu.com/news/national/karnataka/kumaraswamy-congress-leaders-meet-to-discuss-cabinet-reshuffle/article27283040.ece')
tree = html.fromstring(page.content)

article = tree.xpath('//div[@class="article"]/text()')

一旦我打印了文章，就会得到['\ n'，'\ n'，'\ n'，'\ n'，'\ n']的列表，而不是文章的正文。我到底哪里出问题了？

Answer 1

我将在css select_one中使用bs4和类名

import requests
from bs4 import BeautifulSoup as bs
page = requests.get('https://www.thehindu.com/news/national/karnataka/kumaraswamy-congress-leaders-meet-to-discuss-cabinet-reshuffle/article27283040.ece')
soup = bs(page.content, 'lxml')
print(soup.select_one('.article').text)

如果您使用

article = tree.xpath('//div[@class="article"]//text()')

您将获得一个列表，并且仍然获得所有\ n以及我认为可以使用re.sub或条件逻辑处理的文本。

Python HTML抓取找不到我知道存在的属性？

1 个答案: