Question

我的文件包含以下格式的科学家名字 <scientist_names> <scientist>abc</scientist> </scientist_names> 我想用python从上面的格式中删除科学家的名字我应该怎么做？我想使用常规的印象，但不知道如何使用它...请帮助

Answer 1

这是XML，你应该使用像lxml这样的XML解析器而不是正则表达式（因为XML不是常规语言）。

以下是一个例子：

from lxml import etree
text = """<scientist_names> <scientist>abc</scientist> </scientist_names>"""

tree = etree.fromstring(text)
for scientist in tree.xpath("//scientist"):
    print scientist.text

Answer 2

不要使用常规表达式！（所有原因都解释清楚[here]）

使用xml / html解析器，查看BeautifulSoup。

Answer 3

如上所述，这似乎是xml。在这种情况下，您应该使用xml解析器来解析此文档;我推荐lxml（http://lxml.de）。

根据您的要求，可能发现使用SAX样式解析而不是DOM样式更方便，因为SAX解析只涉及在解析器遇到特定标记时注册处理程序，因为标签的含义不依赖于上下文，并且您要处理多种类型的标签（这可能不是这种情况）。

如果您的输入文档可能不正确，您可能希望使用Beautiful Soup：http://www.crummy.com/software/BeautifulSoup/bs3/documentation.html#Parsing XML

Answer 4

以下是一个应该为您处理xml标记的简单示例

#import library to do http requests:
import urllib2

#import easy to use xml parser called minidom:
from xml.dom.minidom import parseString
#all these imports are standard on most modern python implementations

#download the file if it's not on the same machine otherwise just use a path:
file = urllib2.urlopen('http://www.somedomain.com/somexmlfile.xml')
#convert to string:
data = file.read()
#close file because we dont need it anymore:
file.close()
#parse the xml you downloaded
dom = parseString(data)
#retrieve the first xml tag (<tag>data</tag>) that the parser finds with name tagName,
#in your case <scientist>:
xmlTag = dom.getElementsByTagName('scientist')[0].toxml()
#strip off the tag (<tag>data</tag>  --->   data):
xmlData=xmlTag.replace('<scientist>','').replace('</scientist>','')
#print out the xml tag and data in this format: <tag>data</tag>
print xmlTag
#just print the data
print xmlData

如果您发现任何不清楚的地方，请告诉我

使用python从文档中剥离（XML？）标记

4 个答案: