我希望能够以xml格式逐句处理不指定句子的句子。我的输入如下:
<p xmlns="https://jats.nlm.nih.gov/ns/archiving/1.0/">
Recently, a first step in this direction has been taken
in the form of the framework called “dynamical fingerprints”,
which has been developed to relate the experimental and MSM-derived
kinetic information.<sup><xref ref-type="bibr" rid="ref56">56</xref></sup> Several research
groups are now focused on developing protocols to systematically cross-validate
the MSM predictions and obtain MSM parameters using an optimization
protocol that produces the best estimate of the few slowest dynamics
modes of the protein dynamics.<sup><xref ref-type="bibr" rid="ref57">57</xref></sup></p>
我希望我的意见看起来更像是:
<p xmlns="https://jats.nlm.nih.gov/ns/archiving/1.0/">
<s>Recently, a first step in this direction has been taken
in the form of the framework called “dynamical fingerprints”,
which has been developed to relate the experimental and MSM-derived
kinetic information.<sup><xref ref-type="bibr" rid="ref56">56</xref></sup> </s><s>Several research
groups are now focused on developing protocols to systematically cross-validate
the MSM predictions and obtain MSM parameters using an optimization
protocol that produces the best estimate of the few slowest dynamics
modes of the protein dynamics.<sup><xref ref-type="bibr" rid="ref57">57</xref></sup></s></p>
所以我可以提取这些整体:
<s xmlns="https://jats.nlm.nih.gov/ns/archiving/1.0/">Recently, a first step in this direction has been taken
in the form of the framework called “dynamical fingerprints”,
which has been developed to relate the experimental and MSM-derived
kinetic information.<sup><xref ref-type="bibr" rid="ref56">56</xref></sup> </s>
<s xmlns="https://jats.nlm.nih.gov/ns/archiving/1.0/">Several research
groups are now focused on developing protocols to systematically cross-validate
the MSM predictions and obtain MSM parameters using an optimization
protocol that produces the best estimate of the few slowest dynamics
modes of the protein dynamics.<sup><xref ref-type="bibr" rid="ref57">57</xref></sup></s>
我的测试代码是:
from lxml import etree
if __name__=="__main__":
xml1 = '''<p xmlns="https://jats.nlm.nih.gov/ns/archiving/1.0/">
Recently, a first step in this direction has been taken
in the form of the framework called “dynamical fingerprints”,
which has been developed to relate the experimental and MSM-derived
kinetic information.<sup><xref ref-type="bibr" rid="ref56">56</xref></sup> Several research
groups are now focused on developing protocols to systematically cross-validate
the MSM predictions and obtain MSM parameters using an optimization
protocol that produces the best estimate of the few slowest dynamics
modes of the protein dynamics.<sup><xref ref-type="bibr" rid="ref57">57</xref></sup></p>
'''
print xml1
root = etree.XML(xml1)
sentences_info = []
for sentence in root:
# I want to do more fun stuff here with the result
sentence_text = sentence.text
ref_ids = []
for reference in sentence.getchildren():
if 'rid' in reference.attrib.keys():
ref_id = reference.attrib['rid']
ref_ids.append(ref_id)
sent_par = {'reference_ids': ref_ids,'text': sentence_text}
sentences_info.append(sent_par)
print sent_par
答案 0 :(得分:0)
将BeautifulSoup对象转换为字符串然后使用正则表达式进行清理效果很好。例如:
'query_builder' => function(\Prfuk\WebquotaBundle\Entity\WorkplaceRepository $repository) {
return $repository
->createQueryBuilder('s1')
->orderBy('s1.nazev','ASC');
},
据我所知,没有内置的方法来处理xml中的句子,它需要自己的临时解决方案。
答案 1 :(得分:0)
这是在解析XML时,它仍然包含命名空间。基本上,您解析的每个XML都具有以下元素:
<Element {https://jats.nlm.nih.gov/ns/archiving/1.0/}p at 0x108219048>
您可以remove namespace from XML使用以下功能:
from lxml import etree
def remove_namespace(tree):
for node in tree.iter():
try:
has_namespace = node.tag.startswith('{')
except AttributeError:
continue # node.tag is not a string (node is a comment or similar)
if has_namespace:
node.tag = node.tag.split('}', 1)[1]
然后解析XML并删除命名空间
tree = etree.fromstring(xml1)
remove_namespace(tree) # remove namespace
tree.findall('sup') # output as [<Element sup at 0x1081d73c8>, <Element sup at 0x1081d7648>]