我正在解析看起来像这样的XML数据:
<title-group><article-title>Leucine to proline substitution by SNP at position 197 in Caspase-9 gene expression leads to neuroblastoma: a bioinformatics analysis</article-title></title-group>
有时虽然有以下的斜体标签:
<title-group><article-title><italic>Interferon regulatory factor 5</italic> genetic variants are associated with cardiovascular disease in patients with rheumatoid arthritis</article-title></title-group>
以下python代码返回正确连接的标题字符串,但仅当斜体标记不在标题的开头时(如上面的代码所示):
#Get titles
for node in tree.iter('title-group'):
for subnode in node.iter('article-title'):
try:
title = remove_control_characters(subnode.text)
if len(title) == 0:
for subsubnode in node.iter('italic'):
italic = subsubnode.text
tail = remove_control_characters(subsubnode.tail)
title += italic + tail
title = str(title)
break
except:
continue
for subsubnode in node.iter('italic'):
italic = subsubnode.text
tail = remove_control_characters(subsubnode.tail)
title += italic + tail
title = str(title)
当斜体标记位于字符串的开头时,不会返回任何内容。
是否有更简单的方法(不包括lxml)?或者,如果您可以建议更改Python代码,那也将受到赞赏。建议欢迎并度过美好的一天。
编辑 [已解决]
#Get titles
for node in tree.iter('title-group'):
for subnode in node.iter('article-title'):
whole = subnode.itertext()
for parts in whole:
title += parts
print(remove_control_characters(title))
答案 0 :(得分:2)
在itertext()
标记上使用<article-title>
方法,您应该没问题。