尝试编写python脚本以从文件中提取行。该文件是一个文本文件,它是python suds输出的转储。
我想:
文件中的数据是一个列表:
[(ArrayOf_xsd_string){
item[] =
"001",
"ABCD",
"1234",
"wordy type stuff",
"123456",
"more stuff, etc",
}, (ArrayOf_xsd_string){
item[] =
"002",
"ABCD",
"1234",
"wordy type stuff",
"234567",
"more stuff, etc",
}, (ArrayOf_xsd_string){
item[] =
"003",
"ABCD",
"1234",
"wordy type stuff",
"345678",
"more stuff, etc",
}]
我尝试了一个re.compile,这是我对代码的不良尝试:
import re, string
f = open('data.txt', 'rb')
linelist = []
for line in f:
line = re.compile('[\W_]+')
line.sub('', string.printable)
linelist.append(line)
print linelist
newlines = []
for line in linelist:
mylines = line.split()
if re.search(r'\w+', 'ArrayOf_xsd_string'):
newlines.append([next(linelist) for _ in range(6)])
print newlines
我是一个Python新手,并没有在google或stackoverflow上找到任何有关如何在查找特定文本后提取特定行数的结果。非常感谢任何帮助。
请忽略我的代码,因为我正在拍摄“在黑暗中拍摄”:)
以下是我希望看到的结果:
123456: 001,ABCD,1234,wordy type stuff,more stuff etc
234567: 002,ABCD,1234,wordy type stuff,more stuff etc
345678: 003,ABCD,1234,wordy type stuff,more stuff etc
我希望这有助于尝试解释我的错误代码。
答案 0 :(得分:2)
有关您的代码的几点建议:
剥夺所有非字母数字字符是完全没必要和浪费时间的;没有必要建立linelist
。您是否知道可以简单地使用普通的string.find("ArrayOf_xsd_string")
或re.search(...)
?
然后关于你的正则表达式,无论如何_
已涵盖\W
。但是下面的行重新分配会覆盖你刚读过的那行?
for line in f:
line = re.compile('[\W_]+') # overwrites the line you just read??
line.sub('', string.printable)
这是我的版本,它直接读取文件,并处理多个匹配:
with open('data.txt', 'r') as f:
theDict = {}
found = -1
for (lineno,line) in enumerate(f):
if found < 0:
if line.find('ArrayOf_xsd_string')>=0:
found = lineno
entries = []
continue
# Grab following 6 lines...
if 2 <= (lineno-found) <= 6+1:
entry = line.strip(' ""{}[]=:,')
entries.append(entry)
#then create a dict with the key from line 5
if (lineno-found) == 6+1:
key = entries.pop(4)
theDict[key] = entries
print key, ','.join(entries) # comma-separated, no quotes
#break # if you want to end on first match
found = -1 # to process multiple matches
输出正是你想要的(那就是','。join(entries)用于):
123456 001,ABCD,1234,wordy type stuff,more stuff, etc
234567 002,ABCD,1234,wordy type stuff,more stuff, etc
345678 003,ABCD,1234,wordy type stuff,more stuff, etc
答案 1 :(得分:1)
如果要在匹配的特定行之后提取特定行数。您也可以简单地使用readlines读取数组,循环读取它以找到匹配项,然后从数组中获取接下来的N行。此外,您可以使用while循环和readline,如果文件可能变大,则更可取。
以下是我能想到的代码最直接的解决方案,但它不一定是最好的整体实现,我建议按照上面的提示,除非你有充分的理由不要或只是想完成工作通过钩或骗子尽快;)
newlines = []
for i in range(len(linelist)):
mylines = linelist[i].split()
if re.search(r'\w+', 'ArrayOf_xsd_string'):
for l in linelist[i+2:i+20]:
newlines.append(l)
print newlines
如果我正确地解释了你的要求,你应该做你想做的事。这样说:接下一行,然后接下来的17行(接着但不包括匹配后的第20行),将它们附加到换行符(你不能一次附加整个列表,该列表变为单行你要添加它们的列表中的索引。)
玩得开心,祝你好运:)
答案 2 :(得分:0)
让我们对迭代器有一些乐趣!
class SudsIterator(object):
"""extracts xsd strings from suds text file, and returns a
(key, (value1, value2, ...)) tuple with key being the 5th field"""
def __init__(self, filename):
self.data_file = open(filename)
def __enter__(self): # __enter__ and __exit__ are there to support
return self # `with SudsIterator as blah` syntax
def __exit__(self, exc_type, exc_val, exc_tb):
self.data_file.close()
def __iter__(self):
return self
def next(self): # in Python 3+ this should be __next__
"""looks for the next 'ArrayOf_xsd_string' item and returns it as a
tuple fit for stuffing into a dict"""
data = self.data_file
for line in data:
if 'ArrayOf_xsd_string' not in line:
continue
ignore = next(data)
val1 = next(data).strip()[1:-2] # discard beginning whitespace,
val2 = next(data).strip()[1:-2] # quotes, and comma
val3 = next(data).strip()[1:-2]
val4 = next(data).strip()[1:-2]
key = next(data).strip()[1:-2]
val5 = next(data).strip()[1:-2]
break
else:
self.data_file.close() # make sure file gets closed
raise StopIteration() # and keep raising StopIteration
return key, (val1, val2, val3, val4, val5)
data = dict()
for key, value in SudsIterator('data.txt'):
data[key] = value
print data