拆分结构化输入文件中的生成器'yield'会产生不同步的结果

时间:2017-03-17 19:23:44

标签: python xml parsing generator

请参阅@georg的最佳答案(我在下面进行了调整): Split one file into multiple files based on pattern (cut can occur within lines)

我发现这是一个潜在有用的模式,可以根据初始分隔符将文件拆分为多个。 但是,正如评论者所说,它首先创建一个空白文件,其原因尚不清楚。我认为这与我遇到的问题有关。

在我的(笨拙,我不是python master!)改编中,我尝试通过解析分隔符后的行来设置文件名,然后通过调用output = next(fs)生成器打开新的输出文件。

然而,当然,困境是在分隔符之后的行之前不知道域名。我最终得到的文件名与包含的数据不同步。

输入文件包含100多个xml'树',每个树都以标准

开头
<?xml version='1.0' encoding='UTF-8'?>

后跟一行,其中包含域名

<ns2:domain ... name="atypi.org" ...">

这是我目前的剧本:

#!/usr/bin/python2.7

import re

def files():
    n = 0 
    while n<12 :
         n += 1
         print "**DEBUG** in generator nameFile=%s n=%d \r" % (nameFile, n) 
         yield open('/Users/peterf/Google Drive/2015 Projects-Strategy/Domain Admin/RackDomains/%s.part.xml' % nameFile, 'w')


filename='/Users/peterf/Google Drive/2015 Projects-Strategy/Domain Admin/RackspaceListDomain.output.xml'
nameFile=''
pat ='<?xml'
namePat=re.compile('<ns2:domain.+ name="(.+?)".+>')
fs = files()
outfile = next(fs)

with open(filename) as infile:
     for line in infile:   
        m=namePat.search(line)
        if m:
           nameFile=m.group(1)
           print "<---\rin 'if m:' nameFile=%s\r" % (nameFile)   
        if pat not in line: 
#           print "\rin 'pat not in line' line=%s\r" % (line)       
           outfile.write(line)
        else:
            items = line.split(pat)
            outfile.write(items[0])
            for item in items[1:]:
                print "in 'for item' pre next(fs) nameFile=%s\r" % (nameFile)
                outfile = next(fs)
                print "in 'for item' post next(fs) nameFile=%s --->\r" % (nameFile)
                outfile.write(pat + item)

我的调试列表显示:

**DEBUG** in generator nameFile= n=1 

in 'for item' pre next(fs) nameFile=

**DEBUG** in generator nameFile= n=2 

in 'for item' post next(fs) nameFile= --->

<---
in 'if m:' nameFile=addressing.com

in 'for item' pre next(fs) nameFile=addressing.com

**DEBUG** in generator nameFile=addressing.com n=3 

in 'for item' post next(fs) nameFile=addressing.com --->

<---
in 'if m:' nameFile=alicemcmahon.com

in 'for item' pre next(fs) nameFile=alicemcmahon.com

**DEBUG** in generator nameFile=alicemcmahon.com n=4 

in 'for item' post next(fs) nameFile=alicemcmahon.com --->

<---
in 'if m:' nameFile=alphabets.com

in 'for item' pre next(fs) nameFile=alphabets.com

**DEBUG** in generator nameFile=alphabets.com n=5 

in 'for item' post next(fs) nameFile=alphabets.com --->

输出目录包含这些文件名,从第一个'yield'的截断名称开始,我猜...

.part.xml (this has data from 'addressing.com')
addressing.com.part.xml
alicemcmahon.com.part.xml
alphabets.com.part.xml
americanletterpress.com.part.xml
americanwoodtype.com.part.xml
amyshoemaker.com.part.xml
archaicrevivalbooks.com.part.xml
archaicrevivalfonts.com.part.xml
archaicrevivalimages.com.part.xml
astroteddies.com.part.xml

我无法弄清楚如何解决这个问题,在我为文件获取适当的名称之前,生成器正在生成输出文件。

以下是输入文件的一些代表性部分:

<?xml version='1.0' encoding='utf-8'?>
<ns2:domain xmlns:ns3="http://www.w3.org/2005/Atom" xmlns:ns2="http://docs.rackspacecloud.com/dns/api/v1.0" xmlns="http://docs.rackspacecloud.com/dns/api/management/v1.0" id="1204245"  name="addressing.com" ttl="300" emailAddress="ipadmin@stabletransit.com" updated="2012-10-10T21:33:36Z" created="2009-07-25T15:05:39Z">
    <ns2:nameservers>
        <ns2:nameserver name="dns1.stabletransit.com" />
        <ns2:nameserver name="dns2.stabletransit.com" />
    </ns2:nameservers>
    <ns2:recordsList totalEntries="5">
        <ns2:record id="A-2542579" type="A" name="addressing.com" data="198.101.155.141" ttl="300" updated="2012-10-10T21:33:35Z" created="2010-02-17T05:02:16Z" />
    </ns2:recordsList>
</ns2:domain>
<?xml version='1.0' encoding='UTF-8'?>
<ns2:domain xmlns:ns3="http://www.w3.org/2005/Atom" xmlns:ns2="http://docs.rackspacecloud.com/dns/api/v1.0" xmlns="http://docs.rackspacecloud.com/dns/api/management/v1.0" id="2776403"  name="alicemcmahon.com" ttl="300" emailAddress="ipadmin@stabletransit.com" updated="2013-10-21T16:43:17Z" created="2011-05-01T03:01:51Z">
    <ns2:nameservers>
        <ns2:nameserver name="dns1.stabletransit.com" />
        <ns2:nameserver name="dns2.stabletransit.com" />
    </ns2:nameservers>
    <ns2:recordsList totalEntries="10">
        <ns2:record id="A-6895108" type="A" name="alicemcmahon.com" data="216.185.152.144" ttl="300" updated="2013-10-21T16:43:17Z" created="2011-05-01T03:01:51Z" />
    </ns2:recordsList>
</ns2:domain>
<?xml version='1.0' encoding='UTF-8'?>
<ns2:domain xmlns:ns3="http://www.w3.org/2005/Atom" xmlns:ns2="http://docs.rackspacecloud.com/dns/api/v1.0" xmlns="http://docs.rackspacecloud.com/dns/api/management/v1.0" id="1204247"  name="americanletterpress.com" ttl="300" emailAddress="ipadmin@stabletransit.com" updated="2012-10-10T21:33:37Z" created="2009-07-25T15:05:41Z">
    <ns2:nameservers>
        <ns2:nameserver name="dns1.stabletransit.com" />
        <ns2:nameserver name="dns2.stabletransit.com" />
    </ns2:nameservers>
    <ns2:recordsList totalEntries="5">
        <ns2:record id="A-2542581" type="A" name="americanletterpress.com" data="198.101.155.141" ttl="300" updated="2012-10-10T21:33:36Z" created="2010-02-17T05:02:16Z" />        
    </ns2:recordsList>
</ns2:domain>
<?xml version='1.0' encoding='UTF-8'?>
<ns2:domain xmlns:ns3="http://www.w3.org/2005/Atom" xmlns:ns2="http://docs.rackspacecloud.com/dns/api/v1.0" xmlns="http://docs.rackspacecloud.com/dns/api/management/v1.0" id="1204249"  name="americanwoodtype.com" ttl="300" emailAddress="ipadmin@stabletransit.com" updated="2012-10-10T21:33:38Z" created="2009-07-25T15:05:42Z">
    <ns2:nameservers>
        <ns2:nameserver name="dns1.stabletransit.com" />
        <ns2:nameserver name="dns2.stabletransit.com" />
    </ns2:nameservers>
    <ns2:recordsList totalEntries="5">
        <ns2:record id="A-2542583" type="A" name="americanwoodtype.com" data="198.101.155.141" ttl="300" updated="2012-10-10T21:33:37Z" created="2010-02-17T05:02:16Z" />
    </ns2:recordsList>
</ns2:domain>

1 个答案:

答案 0 :(得分:1)

您要求生成器在一开始就生成输出文件:

nameFile=''
# ...
outfile = next(fs)

那是你的空白文件名。推迟调用next(fs),直到您拥有nameFile的值,而不是之前。

您可以在编写之前设置outfile = None并测试None

if pat not in line:
    if outfile is not None: 
        outfile.write(line)
else:
    items = line.split(pat)
    if outfile is not None:
        outfile.write(items[0])

如果您需要在找到第一个文件名之前处理行,请将这些行存储在缓冲区中,并在首次创建新文件时清除缓冲区。

并不是说我认为你应该使用生成器,你真的使用一个过于复杂的东西。只需在循环中直接创建新的文件对象,就会更加清晰。

如果您要做的就是拆分文件,请使用缓冲区,直到您有文件名:

buffer = []
out_name = '/Users/peterf/Google Drive/2015 Projects-Strategy/Domain Admin/RackDomains/%s.part.xml'

outfile = None

with open(filename) as infile:
    for line in infile:
        # look for a filename to write to if we don't have one yet
        if outfile is None:
            match = namePat.search(line)
            if match:
                # New filename, open a file object
                outfile = open(out_name % match.group(1), 'w')
                # clear out the buffer, we'll write directly to 
                # the file after this.
                outfile.writelines(buffer)
                buffer = []

        if '<?xml' in line:
            # new XML doc, close off the previous one
            if outfile is not None:
                outfile.close()
            outfile = None

        # line handling
        if outfile is None:
            buffer.append(line)
        else:
            outfile.write(line)

if outfile is not None:
    outfile.close()
# All lines processed, if there is a buffer left, then we have unhandled lines
if buffer:
    print('There were trailing lines without a name')
    print(*buffer, sep='')