我有一个XML输出,给我看起来像这样:
<lsar030><head></head><body><group1></group1><Part></part><national_stock_number_cross_reference>
<federal_supply_classification>5310</federal_supply_classification>
<national_item_identification_number>008805978</national_item_identification_number>
<figure_number>700</figure_number>
<item_number>34</item_number>
</national_stock_number_cross_reference>
<national_stock_number_cross_reference>
<federal_supply_classification>5310</federal_supply_classification>
<national_item_identification_number>008805978</national_item_identification_number>
<figure_number>701</figure_number>
<item_number>10</item_number>
</national_stock_number_cross_reference>
<national_stock_number_cross_reference>
<federal_supply_classification>5310</federal_supply_classification>
<national_item_identification_number>008805978</national_item_identification_number>
<figure_number>703</figure_number>
<item_number>9</item_number>
</national_stock_number_cross_reference></body></lsar030>
我使用xml.etree.cElementTree
制作了基本代码来整理数据并删除重复的<federal_supply_classification>
和<national_item_identification_number>
信息。
for national_stock_number_cross_reference in body.findall('./' + junk + 'national_stock_number_cross_reference'):
fsc = national_stock_number_cross_reference.find('./' + junk + 'federal_supply_classification').text
niin = national_stock_number_cross_reference.find('./' + junk + 'national_item_identification_number').text
nsn = fsc+niin
if nsn in nsnList:
pass
else:
nsnList.append(nsn)
nsnList.sort()
for nsn in nsnList:
fsc = str(nsn[0:4])
niin = str(nsn[4:])
repeat = False
for national_stock_number_cross_reference in body.findall('./' + junk + 'national_stock_number_cross_reference'):
tabs = 3
figure_number = (national_stock_number_cross_reference.find('./' + junk + 'figure_number').text)
item_number = (int(national_stock_number_cross_reference.find('./' + junk + 'item_number').text))
if national_stock_number_cross_reference.find('./' + junk + 'federal_supply_classification').text == fsc:
if national_stock_number_cross_reference.find('./' + junk + 'national_item_identification_number').text == niin:
nsnIndexXml += '<nsnindxrow>\n'
if repeat == False:
nsnIndexXml += self.getNSNCode(fsc, niin, tabs)
repeat = True
else:
nsnIndexXml += self.getNSNCode('', '', tabs)
if figure_number[0].isnumeric:
figure_number = 'fig' + figure_number
nsnIndexXml += '<callout assocfig="%s" label="%i">\n' % (figure_number, item_number)
nsnIndexXml += '</nsnindxrow>\n'
nsnIndexXml += '</nsnindx>\n'
nsnIndexXml += '</nsnindxwp>'
我的输出最终看起来像这样:
<nsnindxrow>
<nsn>
<fsc>5310</fsc>
<niin>00-880-5978</niin>
</nsn>
<callout assocfig="fig700" label="34">
</nsnindxrow>
<nsnindxrow>
<nsn>
<fsc></fsc>
<niin></niin>
</nsn>
<callout assocfig="fig701" label="10">
</nsnindxrow>
<nsnindxrow>
<nsn>
<fsc></fsc>
<niin></niin>
</nsn>
<callout assocfig="fig703" label="9">
</nsnindxrow>
当我需要以看起来像这样的输出结束时。
<nsnindxrow>
<nsn>
<fsc>5310</fsc>
<niin>00-880-5978</niin>
</nsn>
<callout assocfig="fig700" label="34">
<callout assocfig="fig701" label="10">
<callout assocfig="fig703" label="9">
</nsnindxrow>
是否有一种简单的方法可以添加到查找中并删除代码,还是需要调整我的循环语句?怎么样?
答案 0 :(得分:1)
每次循环浏览<nsnindxrow>
项(第3 national_stock_number_cross_reference
次循环)时,您都会创建一个新的for
代码:
for national_stock_number_cross_reference in body.findall('./' + junk + 'national_stock_number_cross_reference'):
第一次通过for
循环时,它会创建<nsn>
代码,因为repeat
为false。然后repeat
切换到true
,因此它会在下次通过时放置一个空标记。它为您拥有的所有不同federal_supply_classification
项目执行此操作,在这种情况下为3。
您需要在此循环之前移动<nsn>
标记创建过程,以避免每次都重新创建它。
答案 1 :(得分:0)
请原谅我提出另一种方法,但考虑一个XSLT解决方案,因为您的需求是<federal_supply_classification>
和<national_item_identification_number>
配对分组的经典Muenchian Method。作为信息,XSLT(包含XPath的可扩展样式表语言系列的一部分)是一种专用的声明性语言,专门用于将XML文档转换为各种最终用途结构。
Python的lxml模块可以运行XSLT 1.0脚本,也可以在命令行调用Saxon/Xalan或PowerShell / Bash等外部处理器。在这种方法中,您可以避免任何通用循环或条件逻辑甚至元素的字符串连接。最后,XSLT是一个格式良好的XML文件,因此可以像任何其他XML文档一样从文件或字符串中进行解析。
下面的示例将您发布的XML示例包装在根标记中:<root>...</root>
。将XSLT的第一个模板中的 root 更改为实际的XML根目录。
XSLT 脚本(另存为.xsl,将在下面引用)
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" encoding="UTF-8" indent="yes"/>
<xsl:strip-space elements="*"/>
<xsl:key name="numkey" match="national_stock_number_cross_reference"
use="concat(federal_supply_classification, national_item_identification_number)"/>
<xsl:template match="root">
<xsl:apply-templates select="national_stock_number_cross_reference[generate-id() =
generate-id(key('numkey', concat(federal_supply_classification, national_item_identification_number))[1])]"/>
</xsl:template>
<xsl:template match="national_stock_number_cross_reference">
<nsnindxrow>
<nsn>
<fsc><xsl:value-of select="federal_supply_classification"/></fsc>
<niin><xsl:value-of select="national_item_identification_number"/></niin>
</nsn>
<xsl:for-each select="key('numkey', concat(federal_supply_classification, national_item_identification_number))">
<callout>
<xsl:attribute name="assocfig"><xsl:value-of select="concat('fig', figure_number)"/></xsl:attribute>
<xsl:attribute name="label"><xsl:value-of select="item_number"/></xsl:attribute>
</callout>
</xsl:for-each>
</nsnindxrow>
</xsl:template>
</xsl:stylesheet>
Python 脚本
import lxml.etree as ET
# LOAD XML AND XSL FILES
dom = ET.parse('Input.xml')
xslt = ET.parse('XSLTScript.xsl')
# TRANSFORM SOURCE
transform = ET.XSLT(xslt)
newdom = transform(dom)
print(newdom)
# OUTPUT TRANSFORMED TREE TO STRING
tree_out = ET.tostring(newdom, encoding='UTF-8', pretty_print=True, xml_declaration=True)
# OUTPUT STRING TO FILE
xmlfile = open('Output.xml', 'wb')
xmlfile.write(tree_out)
xmlfile.close()
<强>输出强>
<?xml version='1.0' encoding='UTF-8'?>
<nsnindxrow>
<nsn>
<fsc>5310</fsc>
<niin>008805978</niin>
</nsn>
<callout assocfig="fig700" label="34"/>
<callout assocfig="fig701" label="10"/>
<callout assocfig="fig703" label="9"/>
</nsnindxrow>