Question

我的目标是获取XML文件，提取特定元素的所有实例，删除XML标记，然后处理剩余的文本。

我从这开始，它可以删除XML标记，但只能从整个XML文件中删除：

from urllib import urlopen
import re

url = [URL of XML FILE HERE]  #the url of the file to search

raw = urlopen(url).read()   #open the file and read it into variable

exp = re.compile(r'<.*?>')
text_only = exp.sub('',raw).strip()

我还得到了这个text2 = soup.find_all('quoted-block')，它创建了所有quoted-block元素的列表（是的，我知道我需要导入BeautifulSoup）。

但我无法弄清楚如何将正则表达式应用于由soup.find_all产生的列表。我已尝试使用text_only = [item for item in text2 if exp.sub('',item).strip()]和变体，但我一直收到此错误：TypeError: expected string or buffer

我做错了什么？

Answer 1

你不想要正则表达。而只需使用BeautifulSoup's existing support for grabbing text：

quoted_blocks = soup.find_all('quoted-block')
text_chunks = [block.get_text() for block in quoted_blocks]

使用python 2.7中的re.sub（）从old创建新列表

1 个答案: