Question

尝试编写python脚本以从文件中提取行。该文件是一个文本文件，它是python suds输出的转储。

我想：

删除除字和数字之外的所有字符。我不想要任何“\ n”，“[”，“]”，“{”，“=”等字符。
找到以“ArrayOf_xsd_string”
从结果中删除下一行“item [] =”
抓住剩余的6行并根据第五行（123456,234567,345678）上的唯一编号创建一个字典，使用此数字作为键，剩下的行作为值（原谅我的无知，如果我不是用pythonic术语解释这个）
将结果输出到文件

文件中的数据是一个列表：

[(ArrayOf_xsd_string){
   item[] = 
      "001",
      "ABCD",
      "1234",
      "wordy type stuff",
      "123456",
      "more stuff, etc",
 }, (ArrayOf_xsd_string){
   item[] = 
      "002",
      "ABCD",
      "1234",
      "wordy type stuff",
      "234567",
      "more stuff, etc",
 }, (ArrayOf_xsd_string){
   item[] = 
      "003",
      "ABCD",
      "1234",
      "wordy type stuff",
      "345678",
      "more stuff, etc",
 }]

我尝试了一个re.compile，这是我对代码的不良尝试：

import re, string

f = open('data.txt', 'rb')
linelist = []
for line in f:
  line = re.compile('[\W_]+')
 line.sub('', string.printable)
 linelist.append(line)
 print linelist

newlines = []
for line in linelist:
    mylines = line.split()
    if re.search(r'\w+', 'ArrayOf_xsd_string'):
      newlines.append([next(linelist) for _ in range(6)])
      print newlines

我是一个Python新手，并没有在google或stackoverflow上找到任何有关如何在查找特定文本后提取特定行数的结果。非常感谢任何帮助。

请忽略我的代码，因为我正在拍摄“在黑暗中拍摄”：）

以下是我希望看到的结果：

123456: 001,ABCD,1234,wordy type stuff,more stuff etc
234567: 002,ABCD,1234,wordy type stuff,more stuff etc
345678: 003,ABCD,1234,wordy type stuff,more stuff etc

我希望这有助于尝试解释我的错误代码。

Answer 1

有关您的代码的几点建议：

剥夺所有非字母数字字符是完全没必要和浪费时间的;没有必要建立linelist。您是否知道可以简单地使用普通的string.find("ArrayOf_xsd_string")或re.search(...)？

删除除字和数字之外的所有字符。我不想要任何“\ n”，“[”，“]”，“{”，“=”等字符。
找到以“ArrayOf_xsd_string”
从结果中删除下一行“item [] =”

然后关于你的正则表达式，无论如何_已涵盖\W。但是下面的行重新分配会覆盖你刚读过的那行？

for line in f:
  line = re.compile('[\W_]+') # overwrites the line you just read??
  line.sub('', string.printable)

这是我的版本，它直接读取文件，并处理多个匹配：

with open('data.txt', 'r') as f:
    theDict = {}
    found = -1
    for (lineno,line) in enumerate(f):
        if found < 0:
            if line.find('ArrayOf_xsd_string')>=0:
                found = lineno
                entries = []
            continue
        # Grab following 6 lines...
        if 2 <= (lineno-found) <= 6+1:
            entry = line.strip(' ""{}[]=:,')
            entries.append(entry)
        #then create a dict with the key from line 5
        if (lineno-found) == 6+1:
            key = entries.pop(4)
            theDict[key] = entries
            print key, ','.join(entries) # comma-separated, no quotes
            #break # if you want to end on first match
            found = -1 # to process multiple matches

输出正是你想要的（那就是'，'。join（entries）用于）：

123456 001,ABCD,1234,wordy type stuff,more stuff, etc
234567 002,ABCD,1234,wordy type stuff,more stuff, etc
345678 003,ABCD,1234,wordy type stuff,more stuff, etc

Answer 2

如果要在匹配的特定行之后提取特定行数。您也可以简单地使用readlines读取数组，循环读取它以找到匹配项，然后从数组中获取接下来的N行。此外，您可以使用while循环和readline，如果文件可能变大，则更可取。

以下是我能想到的代码最直接的解决方案，但它不一定是最好的整体实现，我建议按照上面的提示，除非你有充分的理由不要或只是想完成工作通过钩或骗子尽快;）

newlines = []
for i in range(len(linelist)):
    mylines = linelist[i].split()
    if re.search(r'\w+', 'ArrayOf_xsd_string'):
        for l in linelist[i+2:i+20]:
            newlines.append(l)
        print newlines

如果我正确地解释了你的要求，你应该做你想做的事。这样说：接下一行，然后接下来的17行（接着但不包括匹配后的第20行），将它们附加到换行符（你不能一次附加整个列表，该列表变为单行你要添加它们的列表中的索引。）

玩得开心，祝你好运：）

Answer 3

让我们对迭代器有一些乐趣！

class SudsIterator(object):
    """extracts xsd strings from suds text file, and returns a 
    (key, (value1, value2, ...)) tuple with key being the 5th field"""
    def __init__(self, filename):
        self.data_file = open(filename)
    def __enter__(self):  # __enter__ and __exit__ are there to support 
        return self       # `with SudsIterator as blah` syntax
    def __exit__(self, exc_type, exc_val, exc_tb):
        self.data_file.close()
    def __iter__(self):
        return self
    def next(self):     # in Python 3+ this should be __next__
        """looks for the next 'ArrayOf_xsd_string' item and returns it as a
        tuple fit for stuffing into a dict"""
        data = self.data_file
        for line in data:
            if 'ArrayOf_xsd_string' not in line:
                continue
            ignore = next(data)
            val1 = next(data).strip()[1:-2] # discard beginning whitespace,
            val2 = next(data).strip()[1:-2] #   quotes, and comma
            val3 = next(data).strip()[1:-2]
            val4 = next(data).strip()[1:-2]
            key = next(data).strip()[1:-2]
            val5 = next(data).strip()[1:-2]
            break
        else:
            self.data_file.close() # make sure file gets closed
            raise StopIteration()  # and keep raising StopIteration
        return key, (val1, val2, val3, val4, val5)

data = dict()
for key, value in SudsIterator('data.txt'):
    data[key] = value

print data

从文件中提取特定行并在python中创建数据部分

3 个答案: