我正在使用以下代码来刮除其他标签:
for content in soup.find_all():
try:
link = content.find('enclosure')
link = link.get('url')
print "\n\nLink: ", link
title = content.find('title')
#<item><guid isPermaLink="false"> == is causing doubling of first episode
#title = content.find('title')
title = title.get_text()
它可以很好地抓取URL,但也可以抓取以下正确的标题,但显然可以按照指示选择前两个名称。如何忽略这些内容并从情节标题开始(第116页)?
(我要抓取的网站是http://feeds.thisiscriminal.com/CriminalShow)
<channel>
<title>Criminal</title>
<link>http://thisiscriminal.com/</link>
</description>
<image>
<url>https://f.prxu.org/criminal/images/....png</url>
<title>Criminal</title>
<link>http://thisiscriminal.com/</link>
<title>Episode 116</title>
<link>http://feeds.thisiscriminal.com/~r/...</link>
<description>
任何输入都将不胜感激!
答案 0 :(得分:2)
您想要以下内容吗?
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('http://feeds.thisiscriminal.com/CriminalShow')
soup = bs(r.content, 'lxml')
for item in soup.select('item'):
print(item.select_one('title').text)
print([i.get('href', i.text) for i in item.select('[href], link') if i.get('href', i.text) !=''])