Question

我手动输入一个由引文组成的输入文件，每个引用格式为：

＆LT; S sid =＆＃34; 2＆＃34; ssid =＆＃34; 2＆＃34;＆gt;它与以前的机器不同基于学习的NER，因为它使用来自整体的信息用一个分类器对每个单词进行分类的文档。＆lt; / S＆GT;＆LT; S sid =＆＃34; 3＆＃34; ssid =＆＃34; 3＆＃34;＆gt;以前涉及从整个文档中收集信息的工作通常使用辅助分类器，它纠正了基于主句的分类器的错误。＆lt; / S＆GT;

这是我目前使用python的re模块的方法：

citance = citance[citance.find(">")+1:citance.rfind("<")]
fd.write(citance+"\n")

我试图从第一个关闭角括号（＆＃34;＆gt;＆＃34;）的出现到最后一个开口角括号（＆＃34;＆＃;＆＃34;）中提取所有内容。但是，如果多个引用，这种方法会失败，因为中间标记也会在输出中被提取出来：

它与之前使用的基于机器学习的NER不同来自整个文档的信息，用于对每个单词进行分类一个分类器。＆lt; / S＆GT;＆LT; S sid =＆＃34; 3＆＃34; ssid =＆＃34; 3＆＃34;＆gt;涉及的先前工作从整个文件中收集信息通常使用一个辅助分类器，它纠正了主要错误基于句子的分类器。

我想要的输出：

它与之前使用的基于机器学习的NER不同来自整个文档的信息，用于对每个单词进行分类一个分类器。以前涉及的工作从整个文件中收集信息通常使用一个辅助分类器，它纠正了主要错误基于句子的分类器。

如何正确实现？

Answer 1

我会使用python regex模块：re 通过做：

re.findall(r'\">(.*?)<', text_to_parse)

此方法将从一个引号返回到多个引号，但是如果您想要统一文本（" ".join(....)）

，则可以加入它们之后

Answer 2

不要使用re模块，而是查看bs4库。

这是一个XML / HTML解析器，因此您可以在标记之间获取所有内容。

对你来说，它会像：

from bs4 import BeautifulSoup

xml_text = '< S sid ="2" ssid = "2">It differs from previous machine learning-based NERs in that it uses information from the whole document to classify each word, with just one classifier.< /S>< S sid ="3" ssid = "3">Previous work that involves the gathering of information from the whole document often uses a secondary classifier, which corrects the mistakes of a primary sentence- based classifier.< /S>'

text_soup = BeautifulSoup(xml_text, 'lxml')

output = text_soup.find_all('S', attrs = {'sid': '2'})

输出将包含文字：

它与以前基于机器学习的NER的不同之处在于，它使用来自整个文档的信息来对每个单词进行分类，只需一个分类器。

此外，如果您只想删除html标签：

import re

xml_text = '< S sid ="2" ssid = "2">It differs from previous machine learning-based NERs in that it uses information from the whole document to classify each word, with just one classifier.< /S>< S sid ="3" ssid = "3">Previous work that involves the gathering of information from the whole document often uses a secondary classifier, which corrects the mistakes of a primary sentence- based classifier.< /S>'

re.sub('<.*?>', '', html_text)

将完成这项工作。

Answer 3

我认为这就是你要找的东西。

import re

string = ">here is some text<>here is some more text<"
matches = re.findall(">(.*?)<", string)
for match in matches: print match

看起来您遇到的问题太多了。 “这里有更多文字＆lt;”的匹配可以是字符串中从第一个到最后一个字符，因为它们是“＆gt;”和“＆lt;”而忽略了中间的那些。 '。*？'成语将使其找到最大命中数。

如何使用正则表达式从文本中提取由标签分隔的多个引文？

3 个答案: