Question

对不起，我没有正确的话来给予头衔。我想要做的是这个代码通过给我所有文本满足我的要求。但问题是，虽然获得类似“＆lt; p＆gt;”，“＆lt; a href ....＆gt;”，“＆lt; h1＆gt;”，“＆lt; h2＆gt;”....的文字也是打印。那么任何可以帮助我跳过这些标签的人呢？我的代码:(我正在使用python 2.7.8）

import urllib
from xml.etree.ElementTree import parse

# Download the RSS feed and parse it
u = urllib.urlopen('http://planet.python.org/rss20.xml')
doc = parse(u)

# Extract and output tags of interest
for item in doc.iterfind('channel/item'):
#    title = item.findtext('title')
#    date = item.findtext('pubDate')
#    link = item.findtext('link')
    des = item.findtext('description')
#    print(title)
#    print(date)
#   print(link)
    print(des)
    print()

Answer 1

尝试使用BeautifulSoup来解析HTML内容如果您只是需要文本，这样的东西将起作用。如果您需要HTML内容中的特定信息，则可以解析HTML。

import urllib
from xml.etree.ElementTree import parse
from bs4 import BeautifulSoup as bs

# Download the RSS feed and parse it
u = urllib.urlopen('http://planet.python.org/rss20.xml')
doc = parse(u)

# Extract and output tags of interest
for item in doc.iterfind('channel/item'):
    des = item.findtext('description')
    if des:
        soup = bs(des)
        text = soup.get_text()
        print(text.encode('utf-8'))

如何在获取数据时跳过<p> </p> <h2> <a ......> </a ......> </h2>

1 个答案: