如何从具有不同ID名称的段落中抓取文本?

时间:2018-01-22 05:16:31

标签: python web-scraping beautifulsoup scrapy

我正在尝试从具有不同ID名称的段落中删除文本。案文如下:

<p id="comFull1" class="comment" style="display:none"><strong>Comment:
</strong><br>I realized how much Abilify has been helping me when I recently 
tried to taper off of it. I am on the bipolar spectrum, with mainly 
depression and some OCD symptoms. My obsessive, intrusive thoughts came 
racing back when I decreased the medication. I also got much more tired and 
had insomnia with the decrease. am not happy with side effects of 15 lb 
weight gain, increased cholesterol and a flat effect on my emotions. I am 
actually wondering if an increase from the 7 mg would help even more...for 
now I&#39;m living with the side effects.<br><a 
onclick="toggle('comTrunc1'); toggle('comFull1');return false;" 
href="#">Hide Full Comment</a></p>

<p id="comFull2" class="comment" style="display:none"><strong>Comment:
</strong><br>It&#39;s worked Very well for me. I&#39;m sleeping I&#39;m 
eating I&#39;m going Out in the public. Overall I&#39;m very 
satisfied.However I haven&#39;t heard anybody mention this but my feet are 
very puffy and swollen is this a side effect does anyone know?<br><a 
onclick="toggle('comTrunc2'); toggle('comFull2');return false;" 
href="#">Hide Full Comment</a></p>

......

我只能从特定的ID中删除文本,但不能同时删除所有ID。任何人都可以帮助解决这个问题,从所有ID中删除文本。代码看起来像这样

>>> from urllib.request import Request, urlopen
>>> from bs4 import BeautifulSoup
>>> url = 'xxxxxxxxxxxxxxxxxxxxxxxxxxxxx'
>>> req = Request(url,headers={'User-Agent': 'Mozilla/5.0'})
>>> webpage = urlopen(req).read()
>>> soup = BeautifulSoup(webpage, "html.parser")
>>> required2 = soup.find("p", {"id": "comFull1"}).text
>>> required2
"Comment:I realized how much Abilify has been helping me when I recently 
tried to taper off of it. I am on the bipolar spectrum, with mainly 
depression and some OCD symptoms. My obsessive, intrusive thoughts came 
racing back when I decreased the medication. I also got much more tired and 
had insomnia with the decrease. am not happy with side effects of 15 lb 
weight gain, increased cholesterol and a flat effect on my emotions. I am 
actually wondering if an increase from the 7 mg would help even more...for 
now I'm living with the side effects.Hide Full Comment"

3 个答案:

答案 0 :(得分:1)

我所理解的问题是抓取网页中所有段落的文字或&lt; \ p&gt; 标记。

您正在寻找的功能是 -

foo1.js

以下文档中显示了一个更全面的示例 -

https://www.crummy.com/software/BeautifulSoup/bs4/doc/

答案 1 :(得分:1)

如果您想使用xpath,可以使用

response.xpath("//p[contains(@id,'comFull')]/text()").extract()

但是,由于您使用的是美丽的汤,您可以将函数或正则表达式传递给此处提到的find_all方法。 Matching id's in BeautifulSoup

soup.find_all('p', id=re.compile('^comFull-'))

答案 2 :(得分:1)

试试这个。如果包含段落的所有ID号都以1,2,3 e.t.c为后缀,就像在comFull1,comFull2,comFull3中一样,那么下面的选择器应该处理它。

from urllib.request import Request, urlopen
from bs4 import BeautifulSoup

soup = BeautifulSoup(content, "html.parser")
for item in soup.select("[id^='comFull']"):
    print(item.text)