我在python中写了一个脚本来提取特定的段落,但后来我最终获得了该页面中的所有信息。我想用不同的页面废弃段落,例如。
<div id="content-body-123123">
并且此ID因不同页面而异。如何识别此特定标记并单独提取此标记内的段落?
url='http://www.thehindu.com/opinion/op-ed/Does-Beijing-really-want-to-
ldquobreak-uprdquo-India/article16875298.ece'
page = requests.get(url)
html=page.content
soup = bs(html, 'html.parser')
for tag in soup.find_all('p'):
print tag.text.encode('utf-8')+'\n'
答案 0 :(得分:0)
试试这个。 id number
的更改不应影响您的结果:
from bs4 import BeautifulSoup
import requests
url = 'http://www.thehindu.com/opinion/op-ed/Does-Beijing-really-want-to-ldquobreak-uprdquo-India/article16875298.ece'
page = requests.get(url)
soup = BeautifulSoup(page.content, 'lxml')
for content in soup.select("[id^='content-body-'] p"):
print(content.text)