Question

我从rss Feed中提取信息。由于进一步的分析，我不是特别想用漂亮的汤或饲料分析器。这个问题的解释范围不合适。

输出正在生成['和']中涵盖的文字。例如

Title:
['The Morning Download: Apple Stumbles but Mobile Soars']
Published:
['Tue, 28 Jan 2014 13:09:04 GMT']

为什么这样的输出？我该如何阻止它？

try:
    #This is the RSS Feed that is being scraped
    page = 'http://finance.yahoo.com/rss/headline?s=aapl'

    yahooFeed = opener.open(page).read()

    try:
        items = re.findall(r'<item>(.*?)</item>', yahooFeed)

        for item in items:
            # Prints the title
            title = re.findall(r'<title>(.*?)</title>', item)
            print "Title:"
            print title

            # Prints the Date / Time Published
            print "Published:"
            datetime = re.findall(r'<pubDate>(.*?)</pubDate>', item)
            print datetime

            print "\n"

    except Exception, e:
        print str(e)

我很感激任何批评，建议和最佳实践信息。

我是一名Java / Perl程序员，所以仍然习惯使用Python，因此非常感谢您所知道的任何优秀资源。

Answer 1

使用re.search代替re.findall，re.findall始终返回所有匹配的list。

datetime = re.search(r'<pubDate>(.*?)</pubDate>', item).group(1)

请注意，re.findall和re.search之间的区别在于前者返回所有匹配的list（Python的数组数据结构），而re.search只会返回找到第一场比赛。

如果不匹配re.search返回None，那么也要处理它：

match = re.search(r'<pubDate>(.*?)</pubDate>', item)
if match is not None:
   datetime = match.group(1)

输出被['和]'包围 - 如何停止？

1 个答案: