Question

我正在尝试使用BeautifulSoup4打印新闻文章的内容。

网址为：Link

我所拥有的当前代码如下所示，它提供了所需的输出：

page = requests.get('http://www.thehindu.com/news/national/People-showing-monumental-patience-queuing-up-for-a-better-India-says-Venkaiah/article16447029.ece')
soup = BeautifulSoup(page.content, 'html.parser')


article_text = ""
table = soup.find_all("div",{ "id": "content-body-14266949-16447029"})                              

for element in table:
    article_text += ''.join(element.find_all(text = True)) + "\n\n"

print(article_text)

然而，问题是我想要抓取多个页面，并且每个页面都有不同的内容正文编号，格式为xxxxxxxx-xxxxxxxx（2个8位数的数据块。）

我尝试用正则表达式替换soup.find_all命令：

table = soup.find_all（text = re.compile（＆＃34; content-body -........-........＆＃34;））

但这会出错：

AttributeError：＆＃39; NavigableString＆＃39;对象没有属性＆＃39; find_all＆＃39;

有人可以指导我做什么吗？

谢谢。

Answer 1

您可以使用lxml来使用提取内容 lxml库允许您使用xpath从html中提取内容

from lxml import etree
selector=etree.HTML(pageText)
article_text=selector.xpath('//div[@class="article-block-multiple live-snippet"]/div[1]')[0].text

我不使用BeautifulSoup。我想你可以像这样使用BeautifulSoup

table = soup.find_all("div",{ "class": "article-block-multiple live-snippet"]"})

然后使用find子元素，找到第一个div元素

Answer 2

正则表达式应该没问题！尝试

table = soup.find_all("div",{ "id": re.compile('content-body-*')})

Answer 3

另一种方法可能是使用css选择器。选择器很干净，非常重要。你也可以尝试一下。只需更改＆＃34; url＆＃34;与你有关的链接。

import requests ; from bs4 import BeautifulSoup

res = requests.get(url).text
soup = BeautifulSoup(res,"html.parser")

for item in soup.select("div[id^=content-body-] p"):
    print(item.text)

使用BeautifulSoup4

3 个答案: