Question

我想使用Python识别字符串中的所有TIMEX3值。例如，如果我的字符串是：

 Ecole Polytechnique, maar hij bleef daar slechts tot <TIMEX3 tid="t5" type="DATE" value="1888">1888</TIMEX3>. 
 Daarna had hij een korte carriere bij het leger als officier d'artillerie in <TIMEX3 tid="t6" type="DATE" value="1889">1889</TIMEX3>

我想回到列表

 ["1888", "1889"]

到目前为止，我尝试使用xml.eTree.ElementTree转换为树，但是这会导致我的数据崩溃并出现解析错误 - 格式错误，令牌消息无效。我想也许我可以使用正则表达式避免这种情况？非常感谢任何帮助，谢谢！

Answer 1

您可以使用BeautifulSoup。

>>> from bs4 import BeautifulSoup
>>> s = '''Ecole Polytechnique, maar hij bleef daar slechts tot <TIMEX3 tid="t5" type="DATE" value="1888">1888</TIMEX3>. 
 Daarna had hij een korte carriere bij het leger als officier d'artillerie in <TIMEX3 tid="t6" type="DATE" value="1889">1889</TIMEX3>'''
>>> soup = BeautifulSoup(s)
>>> [i.text for i in soup.find_all('timex3')]
['1888', '1889']
>>> [i['value'] for i in soup.find_all('timex3')]
['1888', '1889']
>>> [i['value'] for i in soup.find_all('timex3') if i.has_attr("value")]
['1888', '1889']

Answer 2

如果要使用正则表达式，可以执行以下操作：

>>> import re
>>> s = """
... Ecole Polytechnique, maar hij bleef daar slechts tot <TIMEX3 tid="t5" type="DATE" value="1888">1888</TIMEX3>. 
...  Daarna had hij een korte carriere bij het leger als officier d'artillerie in <TIMEX3 tid="t6" type="DATE" value="1889">1889</TIMEX3>"""
>>> result = re.findall(r'value="([\d]+)', s)
>>> result
['1888', '1889']
>>>

但是使用BeautifulSoup，例如Avinash Raj，它可以更好地工作。

使用Python从字符串中获取XML值

2 个答案: