Question

像许多不幸的程序员灵魂一样，我正在处理一种拒绝死亡的古老文件格式。我在说〜1970年格式规范陈旧。如果它完全取决于我，我们会抛弃文件格式和任何知道如何处理它的工具，并从头开始。我可以做梦，但不幸的是，这不会解决我的问题。

格式：非常宽松的定义，因为多年的荒谬修订已经破坏了它曾经拥有的几乎所有后向兼容性。基本上，唯一不变的是有章节标题，关于这些行之前或之后的内容几乎没有规则。标题是顺序的（例如HEADING1，HEADING2，HEADING3，......），但没有编号而且不是必需的（例如HEADING1，HEADING3，HEADING7）。值得庆幸的是，所有可能的标题排列都是已知的。这是一个假的例子：

# Bunch of comments

SHOES # First heading
# bunch text and numbers here

HATS # Second heading
# bunch of text here

SUNGLASSES # Third heading
...

我的问题：我需要通过这些部分标题连接多个这些文件。我有一个非常好的perl脚本：

while(my $l=<>) {

    if($l=~/^SHOES/i) { $r=\$shoes; name($r);}
    elsif($l=~/^HATS/i) { $r=\$hats; name($r);}
    elsif($l=~/^SUNGLASSES/i) { $r=\$sung; name($r);}
    elsif($l=~/^DRESS/i || $l=~/^SKIRT/i ) { $r=\$dress; name($r);}
    ...
    ...
    elsif($l=~/^END/i) { $r=\$end; name($r);}
    else {
        $$r .= $l;
    }
    print STDERR "Finished processing $ARGV\n" if eof;
}

正如您所看到的，使用perl脚本我基本上只是更改了引用指向某个模式匹配时的位置，并将文件的每一行连接到其各自的字符串，直到我进入下一个模式匹配。然后将它们打印出来作为一个大的连续文件。

我会坚持使用perl，但我的需求每天都变得越来越复杂，我真的很想看看如何用python优雅地解决这个问题（可以吗？）。到目前为止，我在python中的方法基本上是将整个文件作为字符串加载，搜索标题位置，然后根据标题索引拆分字符串并连接字符串。这需要大量的正则表达式，if语句和变量用于在另一种语言中看起来如此简单的东西。

这似乎归结为一个基本的语言问题。我发现了一个关于python的“按对象调用”风格的非常好的讨论，与其他语言相比，这些风格是通过引用调用的。 How do I pass a variable by reference? 然而，我仍然无法想到在python中执行此操作的优雅方法。如果有人能帮助我的大脑朝着正确的方向发展，那将非常感激。

Answer 1

那甚至不是优雅的Perl。

my @headers = qw( shoes hats sunglasses dress );

my $header_pat = join "|", map quotemeta, @headers;
my $header_re = qr/$header_pat/i;

my ( $section, %sections );
while (<>) {
    if    (/($header_re)/) { name( $section = \$sections{$1     } ); }
    elsif (/skirt/i)       { name( $section = \$sections{'dress'} ); }
    else { $$section .= $_; }

    print STDERR "Finished processing $ARGV\n" if eof;
}

或者如果你有很多例外：

my @headers = qw( shoes hats sunglasses dress );
my %aliases = ( 'skirt' => 'dress' );

my $header_pat = join "|", map quotemeta, @headers, keys(%aliases);
my $header_re = qr/$header_pat/i;

my ( $section, %sections );
while (<>) {
    if (/($header_re)/) {
       name( $section = \$sections{ $aliases{$1} // $1 } );
    } else {
       $$section .= $_;
    }

    print STDERR "Finished processing $ARGV\n" if eof;
}

使用哈希保存了您未显示的无数my声明。

您还可以$header_name = $1; name(\$sections{$header_name});和$sections{$header_name} .= $_获得更多可读性。

Answer 2

我不确定我是否理解你的整个问题，但这似乎可以做你需要的一切：

import sys

headers = [None, 'SHOES', 'HATS', 'SUNGLASSES']
sections = [[] for header in headers]

for arg in sys.argv[1:]:
    section_index = 0
    with open(arg) as f:
        for line in f:
            if line.startswith(headers[section_index + 1]):
                section_index = section_index + 1
            else:
                sections[section_index].append(line)

显然，您可以将其更改为阅读或mmap整个文件，然后re.search或仅buf.find以获取下一个标题。像这样的东西（未经测试的伪代码）：

import sys

headers = [None, 'SHOES', 'HATS', 'SUNGLASSES']
sections = defaultdict(list)

for arg in sys.argv[1:]:
    with open(arg) as f:
        buf = f.read()
    section = None
    start = 0
    for header in headers[1:]:
        idx = buf.find('\n'+header, start)
        if idx != -1:
            sections[section].append(buf[start:idx])
            section = header
            start = buf.find('\n', idx+1)
            if start == -1:
                break
    else:
        sections[section].append(buf[start:])

还有很多其他选择。

但重点是，我无法在任何地方找到你需要通过引用传递变量的任何地方，所以我不确定你在哪里绊倒你选择的那个。

那么，如果您想将两个不同的标题视为同一部分，该怎么办？

简单：为部分创建dict映射表头。例如，对于第二个版本：

headers_to_sections = {None: None, 'SHOES': 'SHOES', 'HATS': 'HATS',
                       'DRESSES': 'DRESSES', 'SKIRTS': 'DRESSES'}

现在，在执行sections[section]的代码中，只需执行sections[headers_to_sections[section]]。

对于第一个，只需将此字符串从字符串映射到索引而不是字符串到字符串，或者将sections替换为dict。或者只使用collections.OrderedDict展平两个馆藏。

Answer 3

假设您正在读取stdin，就像在perl脚本中一样，这应该这样做：

import sys
import collections
headings = {'SHOES':'SHOES','HATS':'HATS','DRESS':'DRESS','SKIRT':'DRESS'} # etc...
sections = collections.defaultdict(str)
key = None
for line in sys.stdin:
    sline = line.strip()
    if sline not in headings:
        sections[headings.get(key)].append(sline)
    else:
        key = sline

你最终会得到一本字典：

{
    None: <all lines as a single string before any heading>
    'HATS' : <all lines as a single string below HATS heading and before next heading> ],
    etc...
}

由于标题出现在输入中，因此无需按某种顺序定义headings列表。

Answer 4

我最深切的同情！

这是一些代码（请原谅轻微的语法错误）

  def foundSectionHeader(l, secHdrs):
    for s in secHdrs:
      if s in l:
        return True
    return False

  def main():
    fileList = ['file1.txt', 'file2.txt', ...]
    sectionHeaders = ['SHOES', 'HATS', ...]
    sectionContents = dict()
    for section in sectionHeaders:
      sectionContents[section] = []
    for file in fileList:
      fp = open(file)
      lines = fp.readlines()
      idx = 0
      while idx < len(lines):
        sec = foundSectionHeader(lines[idx]):
        if sec:
          idx += 1
          while not foundSectionHeader(lines[idx], sectionHeaders):
            sectionContents[sec].append(lines[idx])
            idx += 1

这假设您没有看起来像“SHOES”/“HATS”等的内容行。

我怎么能用python优雅地组合/连接文件？

4 个答案: