Question

我在python中写了一个脚本来提取特定的段落，但后来我最终获得了该页面中的所有信息。我想用不同的页面废弃段落，例如。

<div id="content-body-123123">

并且此ID因不同页面而异。如何识别此特定标记并单独提取此标记内的段落？

url='http://www.thehindu.com/opinion/op-ed/Does-Beijing-really-want-to-
ldquobreak-uprdquo-India/article16875298.ece'
page = requests.get(url)
html=page.content
soup = bs(html, 'html.parser')
for tag in soup.find_all('p'):
    print tag.text.encode('utf-8')+'\n'

Answer 1

试试这个。 id number的更改不应影响您的结果：

from bs4 import BeautifulSoup
import requests

url = 'http://www.thehindu.com/opinion/op-ed/Does-Beijing-really-want-to-ldquobreak-uprdquo-India/article16875298.ece'
page = requests.get(url)
soup = BeautifulSoup(page.content, 'lxml')
for content in soup.select("[id^='content-body-'] p"):
    print(content.text)

使用beautifulsoup进行文章抓取：使用ids

1 个答案: