我正在尝试从具有不同ID名称的段落中删除文本。案文如下:
<p id="comFull1" class="comment" style="display:none"><strong>Comment:
</strong><br>I realized how much Abilify has been helping me when I recently
tried to taper off of it. I am on the bipolar spectrum, with mainly
depression and some OCD symptoms. My obsessive, intrusive thoughts came
racing back when I decreased the medication. I also got much more tired and
had insomnia with the decrease. am not happy with side effects of 15 lb
weight gain, increased cholesterol and a flat effect on my emotions. I am
actually wondering if an increase from the 7 mg would help even more...for
now I'm living with the side effects.<br><a
onclick="toggle('comTrunc1'); toggle('comFull1');return false;"
href="#">Hide Full Comment</a></p>
<p id="comFull2" class="comment" style="display:none"><strong>Comment:
</strong><br>It's worked Very well for me. I'm sleeping I'm
eating I'm going Out in the public. Overall I'm very
satisfied.However I haven't heard anybody mention this but my feet are
very puffy and swollen is this a side effect does anyone know?<br><a
onclick="toggle('comTrunc2'); toggle('comFull2');return false;"
href="#">Hide Full Comment</a></p>
......
我只能从特定的ID中删除文本,但不能同时删除所有ID。任何人都可以帮助解决这个问题,从所有ID中删除文本。代码看起来像这样
>>> from urllib.request import Request, urlopen
>>> from bs4 import BeautifulSoup
>>> url = 'xxxxxxxxxxxxxxxxxxxxxxxxxxxxx'
>>> req = Request(url,headers={'User-Agent': 'Mozilla/5.0'})
>>> webpage = urlopen(req).read()
>>> soup = BeautifulSoup(webpage, "html.parser")
>>> required2 = soup.find("p", {"id": "comFull1"}).text
>>> required2
"Comment:I realized how much Abilify has been helping me when I recently
tried to taper off of it. I am on the bipolar spectrum, with mainly
depression and some OCD symptoms. My obsessive, intrusive thoughts came
racing back when I decreased the medication. I also got much more tired and
had insomnia with the decrease. am not happy with side effects of 15 lb
weight gain, increased cholesterol and a flat effect on my emotions. I am
actually wondering if an increase from the 7 mg would help even more...for
now I'm living with the side effects.Hide Full Comment"
答案 0 :(得分:1)
我所理解的问题是抓取网页中所有段落的文字或&lt; \ p&gt; 标记。
您正在寻找的功能是 -
foo1.js
以下文档中显示了一个更全面的示例 -
答案 1 :(得分:1)
如果您想使用xpath,可以使用
response.xpath("//p[contains(@id,'comFull')]/text()").extract()
但是,由于您使用的是美丽的汤,您可以将函数或正则表达式传递给此处提到的find_all
方法。
Matching id's in BeautifulSoup
soup.find_all('p', id=re.compile('^comFull-'))
答案 2 :(得分:1)
试试这个。如果包含段落的所有ID号都以1,2,3 e.t.c
为后缀,就像在comFull1,comFull2,comFull3
中一样,那么下面的选择器应该处理它。
from urllib.request import Request, urlopen
from bs4 import BeautifulSoup
soup = BeautifulSoup(content, "html.parser")
for item in soup.select("[id^='comFull']"):
print(item.text)