我想在段落标签中抓取信息。
标记中还包含其他一些标记。我将在下面的代码中向您展示。
这是
这是要抓取的html页面:
<div class="thecontent">
<p>Here’s the schedule of matches for the weekend.</p>
<p> </p>
<p><strong>Saturday, August 17</strong></p>
<p>Achara vs. Buad, <a href="@">ftv</a>, <a href="https://someothertv">HTlive</a>, <a href="http://www.anothertv target="_blank">Se</a> — Have enjoy it and celebrate it</p>
<p>pritos vs. baola, <a href="@">ftv</a>, <a href="https://someothertv">HTlive</a>, <a href="http://www.anothertv target="_blank">Se</a> — Have enjoy it and celebrate it</p>
<p>timpao vs. quadrsa, <a href="@">ftv</a>, <a href="https://someothertv">HTlive</a>, <a href="http://www.anothertv target="_blank">Se</a> — Have enjoy it and celebrate it</p>
<p><strong>Sunday, August 18</strong></p>
<p>Achara vs. timpao, <a href="@">ftv</a>, <a href="https://someothertv">HTlive</a>, <a href="http://www.anothertv target="_blank">Se</a> — Have enjoy it and celebrate it</p>
<p>pritos vs. qaudra, <a href="@">ftv</a>, <a href="https://someothertv">HTlive</a>, <a href="http://www.anothertv target="_blank">Se</a> — Have enjoy it and celebrate it</p>
<p>timpao vs. Buad, <a href="@">ftv</a>, <a href="https://someothertv">HTlive</a>, <a href="http://www.anothertv target="_blank">Se</a> — Have enjoy it and celebrate it</p>
<p> </p>
<p><strong>Monday, August 19</strong></p>
<p>Achara vs. Buad, <a href="@">ftv</a>, <a href="https://someothertv">HTlive</a>, <a href="http://www.anothertv target="_blank">Se</a> — Have enjoy it and celebrate it</p>
</p>
<p> </p></div></body></html>
我使用了以下python代码:
import bs4,requests
getnwp = requests.get('https://url')
nwpcontent = getnwp.content
sp2 = bs4.BeautifulSoup(nwpcontent, 'html5lib')
pta = sp2.find('div', class_ = 'thecontent').find_all('p')
for i in range(len(pta)):
if pta[i].get_text().find("vs") != -1:
print (pta[i].get_text())
利用上述信息,我只想提取团队之间的比赛及其发生的日期。以及如下所示的小消息:
8月17日,星期六
Achara vs. timpao,享受它并庆祝它
pritos vs. baola,—享受它并庆祝它
timpao vs. Quadrsa,请尽情享受并庆祝它
8月18日,星期日
Achara vs. timpao,享受它并庆祝它
pritos vs. qaudra,请尽情享受并庆祝它
timpao vs. Buad,享受它并庆祝它
8月19日,星期一
Achara vs. Buad,享受它并庆祝它
我的意思是我不想要有关电视广播的信息(或锚标签中的信息)。
答案 0 :(得分:1)
看起来包含该内容的段落还包含提示“,-尽情享受并庆祝它”,因此在检索其内容时,它始终会添加。您可以做的是通过做类似
的操作来删除字符串的尾部if len(pta[i] > 33):
pta[i].get_text()[:-33]
这样,您将删除结果字符串的最后33个字符。
答案 1 :(得分:1)
不知道实际来源是什么。例如,假设您可以删除标签,然后使用:has
和:not(:empty)
进行定位。需要bs4 4.7.1 +
from bs4 import BeautifulSoup as bs
import requests
r = requests.get('https://worldsoccertalk.com/2019/08/16/epl-commentator-assignments-nbc-sports-gameweek-2-3/')
soup = bs(r.content, 'lxml')
for a in soup("a"):
a.decompose()
for i in soup.select('.thecontent p:has(strong:not(:contains("SEE MORE"))), .thecontent p:has(strong:not(:contains("SEE MORE"))) ~ p:not(:empty)'):
data = i.text.strip()
if data:
if ' vs. ' in data:
items = data.split(',')
print(', '.join([items[0], items[-1].split('—')[1]]))
else:
print(data)