我正在尝试使用numpy数组加载大型文本数据。 Numpy的loadtxt和genfromtxt不适用于,
['#','!','C']
n*value
,其中n
是重复的整数,value
是浮点数据。因此我尝试使用readlines()
读取文本文件,然后使用Numpy的loadtxt
将数据转换为Numpy数组。
对于阅读和替换,我尝试使用正则表达式(re
模块),但无法使其正常工作。但是,以下Python代码正在运行。我的问题是最有效和Pythonic的做法是什么?
如果是RegEx,在readlines()
列表对象中跟踪查找和替换的正确的正则表达式代码是什么:
lines = ['1 2 3*2.5 3 6 1*.3 8 \n', '! comment here\n', '1*1 2.0 2*2.1 3 6 0 8 \n']
for l, line in enumerate(lines):
if line.strip() == '' or line.strip()[0] in ['#','!','C']:
del lines[l]
for l, line in enumerate(lines):
repls = [word for word in line.strip().split() if word.find('*')>=0]
print repls
for repl in repls:
print repl
line = line.replace(repl, ' '.join([repl.split('*')[1] for n in xrange(int(repl.split('*')[0]))]))
lines[l] = line
print lines
输出如下:
['1 2 2.5 2.5 2.5 3 6 .3 8 \n', '1 2.0 2.1 2.1 3 6 0 8 \n']
评论后,我编辑了我的Python代码如下:
in_lines = ['1 2 3*2.5 3 6 1*.3 8 \n', '! comment here\n', '1*1 2.0 2*2.1 3 6 0 8 \n']
lines = []
for line in in_lines:
if line.strip() == '' or line.strip()[0] in ['#','!','C']:
continue
else:
repls = [word for word in line.strip().split() if word.find('*')>=0]
for repl in repls:
line = line.replace(repl, ' '.join([float(repl.split('*')[1]) for n in xrange(int(repl.split('*')[0]))]))
lines.append(line)
print lines
答案 0 :(得分:1)
使用python的强大功能特性和列表理解:
#!/usr/bin/env python
lines = ['1 2 3*2.5 3 6 1*.3 8 \n', '! comment here\n', '1*1 2.0 2*2.1 3 6 0 8 \n']
#filter out comments
lines = [line for line in lines if line.strip() != '' and line.strip()[0] not in ['#','!','C']]
#turns lines into lists of tokens
lines = [[word for word in line.strip().split()] for line in lines]
# turns a list of strings into a number generator, parsing '*' properly
def generate_numbers(tokens):
for token in tokens:
if '*' in token:
n,m = token.split("*")
for i in range(int(n)):
yield float(m)
else:
yield float(token)
# use the generator to clean up the lines
lines = [list(generate_numbers(tokens)) for tokens in lines]
print lines
输出:
➤ ./try.py
[[1.0, 2.0, 2.5, 2.5, 2.5, 3.0, 6.0, 0.3, 8.0], [1.0, 2.0, 2.1, 2.1, 3.0, 6.0, 0.0, 8.0]]
此解决方案使用生成器而不是列表,因此您不必将整个文件加载到内存中。注意使用两个习语:
with open("name") as file
退出块后,这将清理文件句柄。
for line in file
这将使用生成器迭代文件中的行,而不会将整个文件加载到内存中。
这给了我们:
#!/usr/bin/env python
# turns a list of strings into a number generator, parsing '*' properly
def generate_numbers(tokens):
for token in tokens:
if '*' in token:
n,m = token.split("*")
for i in range(int(n)):
yield float(m)
else:
yield float(token)
# Pull this out to make the code more readable
def not_comment(line):
return line.strip() != '' and line.strip()[0] not in ['#','!','C']
with open("try.dat") as file:
lines = (
list(generate_numbers((word for word in line.strip().split())))
for line in file if not_comment(line)
) # lines is a lazy generator
for line in lines:
print line
输出:
➤ ./try.py
[1.0, 2.0, 2.5, 2.5, 2.5, 3.0, 6.0, 0.3, 8.0]
[1.0, 2.0, 2.1, 2.1, 3.0, 6.0, 0.0, 8.0]