BeautifulSoup打印多个标签/ attr

时间:2011-05-16 17:10:58

标签: python xml beautifulsoup xml-parsing

首先,这是我第一次尝试使用Python,到目前为止它看起来很容易使用,但我仍然遇到了问题..

我正在尝试将XML文件更改为rss-XML 原始的xml源代码如下:

<news title="Random Title" date="Date and Time" subtitle="The article txt"></news>

最终看起来像这样:

<item>
<pubDate>Date and Time</pubDate>
<title>Random Title</title>
<content:encoded>The article txt</content:encoded>
</item>

我正在尝试使用python和BeautifulSoup,使用以下脚本

from BeautifulSoup import BeautifulSoup
import re

doc = [
'<news post_title="Random Title" post_date="Date and Time" post_content="The article txt">''</news></p>'
    ]
soup = BeautifulSoup(''.join(doc))

print soup.prettify()

posttitle = soup.news['post_title']
postdate = soup.news['post_date']
postcontent = soup.news['post_content']

print "<item>"
print "<pubDate>"
print postdate
print "</pubDate>"
print "<title>"
print posttitle
print "</title>"
print "<content:encoded>"
print postcontent
print "</content:encoded>"
print "</item>"

这里的问题是,它只检索最多的ontop字符串XML,而不是其他的。 任何人都可以给我一些解决方法吗?

干杯:)

2 个答案:

答案 0 :(得分:0)

您的示例doc变量只包含一个<news>元素。

但一般来说,你需要循环播放新闻元素

类似

for news in soup.findAll('news'):
    posttitle = news['post_title']
    postdate = news['post_date']
    postcontent = news['post_content']
    print "<item>"
    print "<pubDate>"
    print postdate
    print "</pubDate>"
    print "<title>"
    print posttitle
    print "</title>"
    print "<content:encoded>"
    print postcontent
    print "</content:encoded>"
    print "</item>"

答案 1 :(得分:0)

窃取代码并进行纠正:

for news in soup.findAll('news'):
    posttitle = news['post_title']
    postdate = news['post_date']
    postcontent = news['post_content']
    print "<item>"
    print "<pubDate>"
    print postdate
    print "</pubDate>"
    print "<title>"
    print posttitle
    print "</title>"
    print "<content:encoded>"
    print postcontent
    print "</content:encoded>"
    print "</item>"