Question

我正在尝试解析新闻网站的RSS订阅源，并将发布日期，标题，说明和链接提取到实际文章中。到目前为止，这些代码行：

with open('text.txt', 'r', encoding='utf-8') as f:
    soup = bs4.BeautifulSoup(f, 'lxml')
    all_item_tags = soup.find_all('item')
    first = all_item_tags[0]
    second = all_item_tags[1]
    print(first.contents[9].contents[0], first.contents[1].contents[0], first.contents[4], first.contents[5].contents[0])
    print(second.contents[9].contents[0], second.contents[1].contents[0], second.contents[4], second.contents[5].contents[0])

我得到了这些信息，但我无法弄清楚如何循环它以获取all_item_tags的所有索引，然后抓取这些索引的.contents[].contents[]而无需编写first second third等。

修改：text.txt - http://www.dailymail.co.uk/home/index.rss

的内容

Answer 1

来自评论部分：

for item_tag in all_item_tags怎么样？ - t.m.adam

Answer 2

提要已作为RSS文档提供。不用使用beautifulsoup来解析它，而是省去了抓取页面的麻烦，而改用feedparser。它解析RSS和Atom，并为您进行大量标准化。在学习自然语言处理的同时建立新闻语料库，这是一个救命稻草。

安装feedparser

pip install feedparser

解析供稿

import feedparser

dailymail = feedparser.parse('http://www.dailymail.co.uk/home/index.rss')
for entry in dailymail.feed.entries:
  title = item.get('title', 'No title')
  description = item.get('summary', 'No descrition.')

循环通过bs4.element.tag

2 个答案: