我有以下RSS Feed(soundcloud)Ancestor Queries:
<item>
<pubDate>Mon, 05 Jun 2017 00:00:00 +0000</pubDate>
<link>https://example.com</link>
<item>
我尝试使用以下内容获取链接标记内容:
soup = BeautifulSoup(response, "lxml")
items = soup.findAll("item")
for i in items:
print i
created_at = i.find('pubdate')
created_at = created_at.contents[0][:16]
url = i.find('link')
This prints:
<link/>
如果我尝试url = i.find('link').string
或url = i.find('link').content
我得到了
无
当我打印&#34; i&#34; item首先为链接打印一个关闭标记:
http://feeds.soundcloud.com/users/soundcloud:users:7393028/sounds.rss 0时02分23秒 Daptone唱片公司 没有 莎朗琼斯&amp; Dap-Kings&#39;首张假日专辑现已发售!
如何才能正确打开链接?
答案 0 :(得分:1)
你可以做这样的事情,然后做好工作:
from bs4 import BeautifulSoup as bs
from urllib.request import urlopen
url = 'http://feeds.soundcloud.com/users/soundcloud:users:7393028/sounds.rss'
data = urlopen(url).read()
parsed = bs(data, 'xml')
items = parsed.findAll('item')
for k in items:
# Here is how you can access to the tags inside item tag
print("Link:", k.link.text)
print("pubDate:", k.pubDate.text)
修改:使用lxml
当我尝试使用<link>...</link>
和BeautifulSoup
解析lxml
标记时,我的标记无效。每个链接的代码都以</link>
开头,而BeautifulSoup
无法解析其数据。
所以,一个简单的黑客正在使用regex
,这是一个例子:
from bs4 import BeautifulSoup as bs
from urllib.request import urlopen
import re
url = 'http://feeds.soundcloud.com/users/soundcloud:users:7393028/sounds.rss'
data = urlopen(url).read()
soup = bs(data, 'lxml')
aa = soup.findAll('item')
for k in aa:
link = re.findall('<link/>(.*?)\s+', str(k))
pubdate = k.find('pubdate').string
print("Link: {}\npubdate: {}".format(' '.join(link), pubdate))
两种方法都会输出:
Link: https://soundcloud.com/daptone-records/move-upstairs
pubDate: Tue, 21 Mar 2017 20:30:49 +0000
...
Link: https://soundcloud.com/daptone-records/the-frightnrs-id-rather-go-blind-1
pubDate: Sun, 28 Jun 2015 00:00:00 +0000