我有一个包含超过10万行的txt文件,并且我想为每一行创建一个XML树。但是所有行都共享相同的根。
这里是txt文件:
LIBRARY:
1,1,1,1,the
1,2,1,1,world
2,1,1,2,we
2,5,2,1,have
7,3,1,1,food
所需的输出:
<LIBRARY>
<BOOK ID ="1">
<CHAPTER ID ="1">
<SENT ID ="1">
<WORD ID ="1">the</WORD>
</SENT>
</CHAPTER>
</BOOK>
<BOOK ID ="1">
<CHAPTER ID ="2">
<SENT ID ="1">
<WORD ID ="1">world</WORD>
</SENT>
</CHAPTER>
</BOOK>
<BOOK ID ="2">
<CHAPTER ID ="1">
<SENT ID ="1">
<WORD ID ="2">we</WORD>
</SENT>
</CHAPTER>
</BOOK>
<BOOK ID ="2">
<CHAPTER ID ="5">
<SENT ID ="2">
<WORD ID ="1">have</WORD>
</SENT>
</CHAPTER>
</BOOK>
<BOOK ID ="7">
<CHAPTER ID ="3">
<SENT ID ="1">
<WORD ID ="1">food</WORD>
</SENT>
</CHAPTER>
</BOOK>
</LIBRARY>
我使用Element树将txt文件转换为xml文件,这是我运行的代码
def expantree():
lines = txtfile.readlines()
for line in lines:
split_line = line.split(',')
BOOK.set( 'ID ', split_line[0])
CHAPTER.set( 'ID ', split_line[1])
SENTENCE.set( 'ID ', split_line[2])
WORD.set( 'ID ', split_line[3])
WORD.text = split_line[4]
tree = ET.ElementTree(Root)
tree.write(xmlfile)
好的,代码正在运行,但我没有得到所需的输出,我得到了以下内容:
<LIBRARY>
<BOOK ID ="1">
<CHAPTER ID ="1">
<SENT ID ="1">
<WORD ID ="1">the</WORD>
</SENT>
</CHAPTER>
</BOOK>
</LIBRARY>
<LIBRARY>
<BOOK ID ="1">
<CHAPTER ID ="2">
<SENT ID ="1">
<WORD ID ="1">world</WORD>
</SENT>
</CHAPTER>
</BOOK>
</LIBRARY>
<LIBRARY>
<BOOK ID ="2">
<CHAPTER ID ="1">
<SENT ID ="1">
<WORD ID ="2">we</WORD>
</SENT>
</CHAPTER>
</BOOK>
</LIBRARY>
<LIBRARY>
<BOOK ID ="2">
<CHAPTER ID ="5">
<SENT ID ="2">
<WORD ID ="1">have</WORD>
</SENT>
</CHAPTER>
</BOOK>
</LIBRARY>
<LIBRARY>
<BOOK ID ="7">
<CHAPTER ID ="3">
<SENT ID ="1">
<WORD ID ="1">food</WORD>
</SENT>
</CHAPTER>
</BOOK>
</LIBRARY>
如何统一树根,所以我得到一个root标签而不是获得多个root标签?
答案 0 :(得分:1)
另一个可能更简洁的选择如下:
from xml.etree import ElementTree as ET
import io
import os
# Setup the test input
inbuf = io.StringIO(''.join(['LIBRARY:\n', '1,1,1,1,the\n', '1,2,1,1,world\n',
'2,1,1,2,we\n', '2,5,2,1,have\n', '7,3,1,1,food\n']))
tags = ['BOOK', 'CHAPTER', 'SENT', 'WORD']
with inbuf as into, io.StringIO() as xmlfile:
root_name = into.readline()
root = ET.ElementTree(ET.Element(root_name.rstrip(':\n')))
re = root.getroot()
for line in into:
values = line.split(',')
parent = re
for i, v in enumerate(values[:4]):
parent = ET.SubElement(parent, tags[i], {'ID': v})
if i == 3:
parent.text = values[4].rstrip(':\n')
root.write(xmlfile, encoding='unicode', xml_declaration=True)
xmlfile.seek(0, os.SEEK_SET)
for line in xmlfile:
print(line)
此代码的作用是从输入数据构造ElementTree
并将其作为XML文件写入类文件对象。此代码可以使用标准Python xml.etree
包或lxml
。代码使用Python 3.3进行测试。
答案 1 :(得分:1)
这是一个使用lxml(使用Python 2.7测试)的建议。代码也很容易适应与ElementTree一起使用,但是很难获得漂亮的漂亮打印输出(有关此内容的更多内容,请参阅https://stackoverflow.com/a/16377996/407651。)
输入文件是library.txt,输出文件是library.xml。
from lxml import etree
lines = open("library.txt").readlines()
library = etree.Element('LIBRARY') # The root element
# For each line with data in the input file, create a BOOK/CHAPTER/SENT/WORD structure
for line in lines:
values = line.split(',')
if len(values) == 5:
book = etree.SubElement(library, "BOOK")
book.set("ID", values[0])
chapter = etree.SubElement(book, "CHAPTER")
chapter.set("ID", values[1])
sent = etree.SubElement(chapter, "SENT")
sent.set("ID", values[2])
word = etree.SubElement(sent, "WORD")
word.set("ID", values[3])
word.text = values[4].strip()
etree.ElementTree(library).write("library.xml", pretty_print=True)
答案 2 :(得分:0)
一种方法是创建完整的树并打印它。我使用了以下代码:
from lxml import etree as ET
def create_library(lines):
library = ET.Element('LIBRARY')
for line in lines:
split_line = line.split(',')
library.append(create_book(split_line))
return library
def create_book(split_line):
book = ET.Element('BOOK',ID=split_line[0])
book.append(create_chapter(split_line))
return book
def create_chapter(split_line):
chapter = ET.Element('CHAPTER',ID=split_line[1])
chapter.append(create_sentence(split_line))
return chapter
def create_sentence(split_line):
sentence = ET.Element('SENT',ID=split_line[2])
sentence.append(create_word(split_line))
return sentence
def create_word(split_line):
word = ET.Element('WORD',ID=split_line[3])
word.text = split_line[4]
return word
然后您创建文件的代码如下所示:
def expantree():
lines = txtfile.readlines()
library = create_library(lines)
ET.ElementTree(lib).write(xmlfile)
如果您不想在内存中加载整个树(您提到有超过10万行),您可以手动创建标记,一次编写一本书,然后添加标记。在这种情况下,您的代码将如下所示:
def expantree():
lines = txtfile.readlines()
f = open(xmlfile,'wb')
f.write('<LIBRARY>')
for line in lines:
split_line = line.split(',')
book = create_book(split_line)
f.write(ET.tostring(book))
f.write('</LIBRARY>')
f.close()
我对lxml没有太多经验,所以可能有更优雅的解决方案,但这两种方法都有效。