Question

这是我的txt文件：

In File Name:   C:\Users\naqushab\desktop\files\File 1.m1
Out File Name:  C:\Users\naqushab\desktop\files\Output\File 1.m2
In File Size:   Low:    22636   High:   0
Total Process time: 1.859000
Out File Size:  Low:    77619   High:   0

In File Name:   C:\Users\naqushab\desktop\files\File 2.m1
Out File Name:  C:\Users\naqushab\desktop\files\Output\File 2.m2
In File Size:   Low:    20673   High:   0
Total Process time: 3.094000
Out File Size:  Low:    94485   High:   0

In File Name:   C:\Users\naqushab\desktop\files\File 3.m1
Out File Name:  C:\Users\naqushab\desktop\files\Output\File 3.m2
In File Size:   Low:    66859   High:   0
Total Process time: 3.516000
Out File Size:  Low:    217268  High:   0

我正在尝试将其解析为这样的XML格式：

<?xml version='1.0' encoding='utf-8'?>
<root>
    <filedata>
        <InFileName>File 1.m1</InFileName>
        <OutFileName>File 1.m2</OutFileName>
        <InFileSize>22636</InFileSize>
        <OutFileSize>77619</OutFileSize>
        <ProcessTime>1.859000</ProcessTime>
    </filedata>
    <filedata>
        <InFileName>File 2.m1</InFileName>
        <OutFileName>File 2.m2</OutFileName>
        <InFileSize>20673</InFileSize>
        <OutFileSize>94485</OutFileSize>
        <ProcessTime>3.094000</ProcessTime>
    </filedata>
    <filedata>
        <InFileName>File 3.m1</InFileName>
        <OutFileName>File 3.m2</OutFileName>
        <InFileSize>66859</InFileSize>
        <OutFileSize>217268</OutFileSize>
        <ProcessTime>3.516000</ProcessTime>
    </filedata>
</root>

以下是我正在尝试实现的代码（我正在使用Python 2）：

import re
import xml.etree.ElementTree as ET

rex = re.compile(r'''(?P<title>In File Name:
                       |Out File Name:
                       |In File Size:   Low:
                       |Total Process time:
                       |Out File Size:  Low:
                     )
                     (?P<value>.*)
                     ''', re.VERBOSE)

root = ET.Element('root')
root.text = '\n'    # newline before the celldata element

with open('Performance.txt') as f:
    celldata = ET.SubElement(root, 'filedata')
    celldata.text = '\n'    # newline before the collected element
    celldata.tail = '\n\n'  # empty line after the celldata element
    for line in f:
        # Empty line starts new celldata element (hack style, uggly)
        if line.isspace():
            celldata = ET.SubElement(root, 'filedata')
            celldata.text = '\n'
            celldata.tail = '\n\n'

        # If the line contains the wanted data, process it.
        m = rex.search(line)
        if m:
            # Fix some problems with the title as it will be used
            # as the tag name.
            title = m.group('title')
            title = title.replace('&', '')
            title = title.replace(' ', '')

            e = ET.SubElement(celldata, title.lower())
            e.text = m.group('value')
            e.tail = '\n'

# Display for debugging
ET.dump(root)

# Include the root element to the tree and write the tree
# to the file.
tree = ET.ElementTree(root)
tree.write('Performance.xml', encoding='utf-8', xml_declaration=True)

但我得到空值，是否可以将此txt解析为XML？

Answer 1

使用正则表达式进行更正：应该是

m = re.search('(?P<title>(In File Name)|(Out File Name)|(In File Size: *Low)|(Total Process time)|(Out File Size: *Low)):(?P<value>.*)',line)

而不是你所给予的。因为在你的正则表达式中In File Name|Out File Name表示，它会检查In File Nam后跟e或O后跟ut File Name，依此类推。

建议，

您可以在不使用正则表达式的情况下执行此操作。 xml.dom.minidom 可用于美化你的xml字符串。

我已添加内联评论以便更好地理解！

Node.toprettyxml（[indent =＆＃34;＆＃34; [，newl =＆＃34;＆＃34; [，encoding =＆＃34;＆＃34;]]]）< /强>

返回文档的漂亮版本。 indent指定缩进字符串，默认为制表符; newl指定在每行末尾发出的字符串，默认为

修改

import itertools as it [line[0] for line in it.groupby(lines)]

您可以使用groupby of itertools包在列表行中对持久性重复数据删除进行分组

所以，

import xml.etree.ElementTree as ET root = ET.Element('root') with open('file1.txt') as f: lines = f.read().splitlines() #add first subelement celldata = ET.SubElement(root, 'filedata') import itertools as it #for every line in input file #group consecutive dedup to one for line in it.groupby(lines): line=line[0] #if its a break of subelements - that is an empty space if not line: #add the next subelement and get it as celldata celldata = ET.SubElement(root, 'filedata') else: #otherwise, split with : to get the tag name tag = line.split(":") #format tag name el=ET.SubElement(celldata,tag[0].replace(" ","")) tag=' '.join(tag[1:]).strip() #get file name from file path if 'File Name' in line: tag = line.split("\\")[-1].strip() elif 'File Size' in line: splist = filter(None,line.split(" ")) tag = splist[splist.index('Low:')+1] #splist[splist.index('High:')+1] el.text = tag #prettify xml import xml.dom.minidom as minidom formatedXML = minidom.parseString( ET.tostring( root)).toprettyxml(indent=" ",encoding='utf-8').strip() # Display for debugging print formatedXML #write the formatedXML to file. with open("Performance.xml","w+") as f: f.write(formatedXML)

输出：的 Performance.xml

<?xml version="1.0" encoding="utf-8"?> <root> <filedata> <InFileName>File 1.m1</InFileName> <OutFileName>File 1.m2</OutFileName> <InFileSize>22636</InFileSize> <TotalProcesstime>1.859000</TotalProcesstime> <OutFileSize>77619</OutFileSize> </filedata> <filedata> <InFileName>File 2.m1</InFileName> <OutFileName>File 2.m2</OutFileName> <InFileSize>20673</InFileSize> <TotalProcesstime>3.094000</TotalProcesstime> <OutFileSize>94485</OutFileSize> </filedata> <filedata> <InFileName>File 3.m1</InFileName> <OutFileName>File 3.m2</OutFileName> <InFileSize>66859</InFileSize> <TotalProcesstime>3.516000</TotalProcesstime> <OutFileSize>217268</OutFileSize> </filedata> </root>

希望它有所帮助！

Answer 2

来自文档（强调是我的）：

re.VERBOSE
此标志允许您编写正则表达式看起来更好。 模式中的空格被忽略，除非在a中字符类或前面带有未转义的反斜杠，当a时 line在字符类中或前面没有包含'＃' 未转义的反斜杠，从最左边的所有字符'＃'到该行的结尾将被忽略。

正则表达式中的转义空格或使用\s类

如何将.txt文件解析为.xml？

2 个答案: