Question

我有以下RSS Feed（soundcloud）Ancestor Queries：

<item>
      <pubDate>Mon, 05 Jun 2017 00:00:00 +0000</pubDate>
      <link>https://example.com</link>
<item>

我尝试使用以下内容获取链接标记内容：

soup = BeautifulSoup(response, "lxml")


items = soup.findAll("item")
for i in items:
    print i
    created_at = i.find('pubdate')
    created_at = created_at.contents[0][:16]

    url = i.find('link')

This prints:

    <link/>

如果我尝试url = i.find('link').string或url = i.find('link').content

我得到了

无

当我打印＆＃34; i＆＃34; item首先为链接打印一个关闭标记：

http://feeds.soundcloud.com/users/soundcloud:users:7393028/sounds.rss 0时02分23秒 Daptone唱片公司没有莎朗琼斯＆amp; Dap-Kings＆＃39;首张假日专辑现已发售！

如何才能正确打开链接？

Answer 1

你可以做这样的事情，然后做好工作：

from bs4 import BeautifulSoup as bs 
from urllib.request import urlopen

url = 'http://feeds.soundcloud.com/users/soundcloud:users:7393028/sounds.rss'
data = urlopen(url).read()

parsed = bs(data, 'xml')
items = parsed.findAll('item')

for k in items:
    # Here is how you can access to the tags inside item tag
    print("Link:", k.link.text)
    print("pubDate:", k.pubDate.text)

修改：使用lxml

当我尝试使用<link>...</link>和BeautifulSoup解析lxml标记时，我的标记无效。每个链接的代码都以</link>开头，而BeautifulSoup无法解析其数据。

所以，一个简单的黑客正在使用regex，这是一个例子：

from bs4 import BeautifulSoup as bs 
from urllib.request import urlopen
import re

url = 'http://feeds.soundcloud.com/users/soundcloud:users:7393028/sounds.rss'
data = urlopen(url).read()

soup = bs(data, 'lxml')
aa = soup.findAll('item')

for k in aa:
    link = re.findall('<link/>(.*?)\s+', str(k))
    pubdate = k.find('pubdate').string
    print("Link: {}\npubdate: {}".format(' '.join(link), pubdate))

两种方法都会输出：

Link: https://soundcloud.com/daptone-records/move-upstairs
pubDate: Tue, 21 Mar 2017 20:30:49 +0000
...
Link: https://soundcloud.com/daptone-records/the-frightnrs-id-rather-go-blind-1
pubDate: Sun, 28 Jun 2015 00:00:00 +0000

美丽的汤返回关闭标签而不是标签文本

1 个答案: