使用beautifulsoup进行文章抓取:使用ids

时间:2018-01-07 10:30:28

标签: python html web-scraping beautifulsoup

我在python中写了一个脚本来提取特定的段落,但后来我最终获得了该页面中的所有信息。我想用不同的页面废弃段落,例如。

<div id="content-body-123123">

并且此ID因不同页面而异。如何识别此特定标记并单独提取此标记内的段落?

url='http://www.thehindu.com/opinion/op-ed/Does-Beijing-really-want-to-
ldquobreak-uprdquo-India/article16875298.ece'
page = requests.get(url)
html=page.content
soup = bs(html, 'html.parser')
for tag in soup.find_all('p'):
    print tag.text.encode('utf-8')+'\n'

1 个答案:

答案 0 :(得分:0)

试试这个。 id number的更改不应影响您的结果:

from bs4 import BeautifulSoup
import requests

url = 'http://www.thehindu.com/opinion/op-ed/Does-Beijing-really-want-to-ldquobreak-uprdquo-India/article16875298.ece'
page = requests.get(url)
soup = BeautifulSoup(page.content, 'lxml')
for content in soup.select("[id^='content-body-'] p"):
    print(content.text)