我几乎从这里merging xml files using python's ElementTree重用了相同的代码,我得到了它的工作。我试图合并的XML文件看起来像这样
A.XML
<root>
<categories>
<category name="Biology" />
</categories>
<app>
<mainHeader><![CDATA[AP Biology]]></mainHeader>
<questions>
<question type="0" number="1" title="Biology #1">
<images />
<description><![CDATA[<b>Which of the following is
the site of protein synthesis?</b>]]></description>
<category><![CDATA[Biology]]></category>
<choices>
<choice name="A"><![CDATA[Cell wall]]></choice>
<choice name="B" correct_answer="true"><![CDATA[Ribosomes]]></choice>
<choice name="C"><![CDATA[Vacuoles]]></choice>
<choice name="D"><![CDATA[DNA polymerase]]></choice>
<choice name="E"><![CDATA[RNA polymerase]]></choice>
</choices>
<explanation><![CDATA[<b>Answer:</b> B, Ribosomes. Translation, the
process that converts mRNA code into protein, takes place in ribosomes.
<br /><br /><b>Key Takeaway: </b>Ribosomes are complexes of RNA and
protein that are located in cell nuclei. Ribosomes catalyze both the
conversion of the mRNA code into amino acids as well as the assembly of
the individual amino acids into a peptide change that becomes a protein.
]]></explanation>
</question>
</questions>
</app>
</root>
B.XML
<root>
<categories>
<category name="Biology" />
</categories>
<app>
<mainHeader><![CDATA[SAT Biology]]></mainHeader>
<questions>
<question type="0" number="1" title="Biology #1">
<images>
</images>
<category><![CDATA[Biology]]></category>
<description><![CDATA[<b>The site of cellular respiration
is:</b>]]></description>
<choices>
<choice name="A"><![CDATA[DNA polymerase]]></choice>
<choice name="B"><![CDATA[Ribosomes]]></choice>
<choice name="C" correct_answer="true"><![CDATA[Mitochondria]]></choice>
<choice name="D"><![CDATA[RNA polymerase]]></choice>
<choice name="E"><![CDATA[Vacuoles]]></choice>
</choices>
<explanation><![CDATA[<b>Answer:</b> C, Mitochondria.
The mitochondrion (plural mitochondria) is known as the “powerhouse”
of the cell for its role in energy production.<br /><br />
<b>Key Takeaway: </b>The mitochondrion is a membrane-bound organelle
found in most eukaryotic cells. The dominant role of the mitochondrion
is the production of ATP through cellular respiration, which is
dependent on the presence of oxygen. All forms of cellular
respiration, glycolysis, Krebs’ cycle, and oxidative phosphorylation,
take place within the mitochondria.]]></explanation>
</question>
</questions>
</app>
</root>
这是我用来合并它们的代码
import os, os.path, sys
import glob
from xml.etree import ElementTree
def run(files):
xml_files = glob.glob(files +"/*.xml")
xml_element_tree = None
for xml_file in xml_files:
data = ElementTree.parse(xml_file).getroot()
# print ElementTree.tostring(data)
for question in data.iter('questions'):
if xml_element_tree is None:
xml_element_tree = data
insertion_point = xml_element_tree.find('app').findall("./questions")[0]
else:
insertion_point.extend(question)
if xml_element_tree is not None:
print ElementTree.tostring(xml_element_tree)
除了输出不保持CDATA标签外,它的工作原理。具体来说,这是我得到的输出。
<root>
<categories>
<category name="Biology" />
</categories>
<app>
<mainHeader>AP Biology</mainHeader>
<questions>
<question number="1" title="Biology #1" type="0">
<images />
<category>Biology</category>
<description><b>Which of the following is the site
of protein synthesis?</b></description>
<choices>
<choice name="A">Cell wall</choice>
<choice correct_answer="true" name="B">Ribosomes</choice>
<choice name="C">Vacuoles</choice>
<choice name="D">DNA polymerase</choice>
<choice name="E">RNA polymerase</choice>
</choices>
<explanation><b>Answer:</b> B, Ribosomes.
Translation, the process that converts mRNA code into protein,
takes place in ribosomes.<br /><br /><b>
Key Takeaway: </b>Ribosomes are complexes of RNA and protein
that are located in cell nuclei. Ribosomes catalyze both the
conversion of the mRNA code into amino acids as well as the assembly
of the individual amino acids into a peptide change that becomes
a protein.</explanation>
</question>
<question number="1" title="Biology #1" type="0">
<images>
</images>
<category>Biology</category>
<description><b>The site of cellular respiration is:</b>
</description>
<choices>
<choice name="A">DNA polymerase</choice>
<choice name="B">Ribosomes</choice>
<choice correct_answer="true" name="C">Mitochondria</choice>
<choice name="D">RNA polymerase</choice>
<choice name="E">Vacuoles</choice>
</choices>
<explanation><b>Answer:</b> C, Mitochondria. The
mitochondrion (plural mitochondria) is known as the “
powerhouse” of the cell for its role in energy production.
<br /><br /><b>Key Takeaway: </b>The
mitochondrion is a membrane-bound organelle found in most
eukaryotic cells. The dominant role of the mitochondrion is the
production of ATP through cellular respiration, which is dependent
on the presence of oxygen. All forms of cellular respiration,
glycolysis, Krebs’ cycle, and oxidative phosphorylation,
take place within the mitochondria.</explanation>
</question>
</questions>
</app>
</root>
虽然我想要的输出是这个
<root>
<categories>
<category name="Biology" />
</categories>
<app>
<mainHeader><![CDATA[AP Biology]]></mainHeader>
<questions>
<question type="0" number="1" title="Biology #1">
<images />
<category><![CDATA[Biology]]></category>
<description><![CDATA[<b>Which of the following is the
site of protein synthesis?</b>]]></description>
<choices>
<choice name="A"><![CDATA[Cell wall]]></choice>
<choice name="B" correct_answer="true"><![CDATA[Ribosomes]]></choice>
<choice name="C"><![CDATA[Vacuoles]]></choice>
<choice name="D"><![CDATA[DNA polymerase]]></choice>
<choice name="E"><![CDATA[RNA polymerase]]></choice>
</choices>
<explanation><![CDATA[<b>Answer:</b> B, Ribosomes. Translation,
the process that converts mRNA code into protein, takes place in
ribosomes.<br /><br /><b>Key Takeaway: </b>Ribosomes are complexes
of RNA and protein that are located in cell nuclei. Ribosomes
catalyze both the conversion of the mRNA code into amino acids as
well as the assembly of the individual amino acids into a peptide
change that becomes a protein.]]></explanation>
</question>
<question type="0" number="2" title="Biology #1">
<images />
<category><![CDATA[Biology]]></category>
<description><![CDATA[<b>The site of cellular respiration
is:</b>]]></description>
<choices>
<choice name="A"><![CDATA[DNA polymerase]]></choice>
<choice name="B"><![CDATA[Ribosomes]]></choice>
<choice name="C" correct_answer="true"><![CDATA[Mitochondria]]></choice>
<choice name="D"><![CDATA[RNA polymerase]]></choice>
<choice name="E"><![CDATA[Vacuoles]]></choice>
</choices>
<explanation><![CDATA[<b>Answer:</b> C, Mitochondria. The
mitochondrion (plural mitochondria) is known as the “powerhouse”
of the cell for its role in energy production.<br /><br />
<b>Key Takeaway: </b>The mitochondrion is a membrane-bound
organelle found in most eukaryotic cells. The dominant role
of the mitochondrion is the production of ATP through cellular
respiration, which is dependent on the presence of oxygen.
All forms of cellular respiration, glycolysis, Krebs’ cycle,
and oxidative phosphorylation, take place within the
mitochondria.]]></explanation>
</question>
</questions>
</app>
</root>
如何在合并输出中维护CDATA标记?如何在我的合并输出中保留<b>
,<br>
,"
“,而不是像<b>
那样得到奇怪的东西?抱歉我的真正的noob问题,但我真的很感激帮助。
答案 0 :(得分:1)
CDATA
专门用于xml解析器应忽略的数据。我认为在这种情况下你能做的最好的就是捕捉文字:
>>> element = et.fromstring('''<explanation><![CDATA[<b>Answer:</b> B, Ribosomes. Translation,
the process that converts mRNA code into protein, takes place in
ribosomes.<br /><br /><b>Key Takeaway: </b>Ribosomes are complexes
of RNA and protein that are located in cell nuclei. Ribosomes
catalyze both the conversion of the mRNA code into amino acids as
well as the assembly of the individual amino acids into a peptide
change that becomes a protein.]]></explanation>''')
>>> element.text
'<b>Answer:</b> B, Ribosomes. Translation, \n the process that converts mRNA code into protein, takes place in \n ribosomes.<br /><br /><b>Key Takeaway: </b>Ribosomes are complexes \n of RNA and protein that are located in cell nuclei. Ribosomes \n catalyze both the conversion of the mRNA code into amino acids as \n well as the assembly of the individual amino acids into a peptide \n change that becomes a protein.'
然后你可以像@praveen建议的那样忘记你的文字。
答案 1 :(得分:0)
使用HTMLParse python库,但这不会创建那些CDATA内容。
text = """
<root>
<categories>
<category name="Biology" />
</categories>
<app>
<mainHeader>AP Biology</mainHeader>
<questions>
<question number="1" title="Biology #1" type="0">
<images />
<category>Biology</category>
<description><b>Which of the following is the site
of protein synthesis?</b></description>
<choices>
<choice name="A">Cell wall</choice>
<choice correct_answer="true" name="B">Ribosomes</choice>
<choice name="C">Vacuoles</choice>
<choice name="D">DNA polymerase</choice>
<choice name="E">RNA polymerase</choice>
</choices>
<explanation><b>Answer:</b> B, Ribosomes.
Translation, the process that converts mRNA code into protein,
takes place in ribosomes.<br /><br /><b>
Key Takeaway: </b>Ribosomes are complexes of RNA and protein
that are located in cell nuclei. Ribosomes catalyze both the
conversion of the mRNA code into amino acids as well as the assembly
of the individual amino acids into a peptide change that becomes
a protein.</explanation>
</question>
<question number="1" title="Biology #1" type="0">
<images>
</images>
<category>Biology</category>
<description><b>The site of cellular respiration is:</b>
</description>
<choices>
<choice name="A">DNA polymerase</choice>
<choice name="B">Ribosomes</choice>
<choice correct_answer="true" name="C">Mitochondria</choice>
<choice name="D">RNA polymerase</choice>
<choice name="E">Vacuoles</choice>
</choices>
<explanation><b>Answer:</b> C, Mitochondria. The
mitochondrion (plural mitochondria) is known as the “
powerhouse” of the cell for its role in energy production.
<br /><br /><b>Key Takeaway: </b>The
mitochondrion is a membrane-bound organelle found in most
eukaryotic cells. The dominant role of the mitochondrion is the
production of ATP through cellular respiration, which is dependent
on the presence of oxygen. All forms of cellular respiration,
glycolysis, Krebs’ cycle, and oxidative phosphorylation,
take place within the mitochondria.</explanation>
</question>
</questions>
</app>
</root>
"""
import HTMLParser
html_parser = HTMLParser.HTMLParser()
unescaped = html_parser.unescape(text)
print unescaped
输出:
<root>
<categories>
<category name="Biology" />
</categories>
<app>
<mainHeader>AP Biology</mainHeader>
<questions>
<question number="1" title="Biology #1" type="0">
<images />
<category>Biology</category>
<description><b>Which of the following is the site
of protein synthesis?</b></description>
<choices>
<choice name="A">Cell wall</choice>
<choice correct_answer="true" name="B">Ribosomes</choice>
<choice name="C">Vacuoles</choice>
<choice name="D">DNA polymerase</choice>
<choice name="E">RNA polymerase</choice>
</choices>
<explanation><b>Answer:</b> B, Ribosomes.
Translation, the process that converts mRNA code into protein,
takes place in ribosomes.<br /><br /><b>
Key Takeaway: </b>Ribosomes are complexes of RNA and protein
that are located in cell nuclei. Ribosomes catalyze both the
conversion of the mRNA code into amino acids as well as the assembly
of the individual amino acids into a peptide change that becomes
a protein.</explanation>
</question>
<question number="1" title="Biology #1" type="0">
<images>
</images>
<category>Biology</category>
<description><b>The site of cellular respiration is:</b>
</description>
<choices>
<choice name="A">DNA polymerase</choice>
<choice name="B">Ribosomes</choice>
<choice correct_answer="true" name="C">Mitochondria</choice>
<choice name="D">RNA polymerase</choice>
<choice name="E">Vacuoles</choice>
</choices>
<explanation><b>Answer:</b> C, Mitochondria. The
mitochondrion (plural mitochondria) is known as the “
powerhouse” of the cell for its role in energy production.
<br /><br /><b>Key Takeaway: </b>The
mitochondrion is a membrane-bound organelle found in most
eukaryotic cells. The dominant role of the mitochondrion is the
production of ATP through cellular respiration, which is dependent
on the presence of oxygen. All forms of cellular respiration,
glycolysis, Krebs’ cycle, and oxidative phosphorylation,
take place within the mitochondria.</explanation>
</question>
</questions>
</app>
</root>