我必须逐行读取一个文件,该文件的索引为向量的1的
所以例如: 1 3 9 10
表示: 0,1,0,1,0,0,0,0,0,1,1
我的目标是编写将占用每一行的程序,并用0打印出完整的向量。
我可以用我目前的程序做几行:
#create a sparse vector
list_line_sparse = [0] * int(num_features)
#loop over all the lines
for item in lines:
#split the line on spaces
zz = item.split(' ')
#get all ints on a line
d = [int(x.strip()) for x in zz]
#loop over all ints and change index to 1 in sparse vector
for i in d:
list_line_sparse[i]=1
out_file += (', '.join(str(item) for item in list_line_sparse))
#change back to 0's
for i in d:
list_line_sparse[i]=0
out_file +='\n'
f = open('outfile', 'w')
f.write(out_file)
f.close()
问题是对于具有大量功能和行的文件,我的程序非常低效 - 它基本上永远不会完成。是否有任何突出的东西我应该改变以使其更有效率? (即2 for for循环)
答案 0 :(得分:1)
在生成输出文件时将每行数据写入输出文件可能更有效,而不是在内存中构建一个巨大的字符串。
numpy
是一个流行的Python模块,适用于对数字进行批量操作。如果您从:
import numpy as np
list_line_sparse = np.zeros(num_features, dtype=np.uint8)
然后,将d
作为当前行上的数字列表,您可以执行以下操作:
list_line_sparse[d] = 1
同时在数组中设置所有这些索引,不需要循环。 (至少在Python级别,显然仍然存在循环,但它在numpy的C实现中有所下降。)
答案 1 :(得分:0)
它正在减速,因为你正在进行字符串连接。最好使用列表。
此外,您可以使用csv
读取空格分隔的行,然后用自动添加的逗号写下每一行:
import csv
num_features = 20
with open('input.txt', 'r', newline='') as f_input, open('output.txt', 'w', newline='') as f_output:
csv_input = csv.reader(f_input, delimiter=' ')
csv_output = csv.writer(f_output)
for row in csv_input:
list_line_sparse = [0] * int(num_features)
for v in map(int, row):
list_line_sparse[v] = 1
csv_output.writerow(list_line_sparse)
因此,如果input.txt
包含以下内容:
1 3 9 10
1 3 9 11
2 7 3 5
给你一个output.txt
包含:
0,1,0,1,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0
0,1,0,1,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0
0,0,1,1,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0
答案 2 :(得分:0)
太多循环:首先是item.split()
,然后是for x in zz
,然后是for i in d
,然后是for item in list_line_sparse
,然后是for i in d
。字符串连接可能是您最昂贵的部分:.join
和output +=
。所有这一切都适用于每一行。
您可以尝试“逐字符”解析和书写。像这样:
#features per line
count = int(num_features)
f = open('outfile.txt', 'w')
#loop over all lines
for item in lines:
#reset the feature
i = 0
#the characters buffer
index = ""
#parse character by character
for character in item:
#if a space or end of line is found,
#and the characters buffer (index) is not empty
if character in (" ", "\r", "\n"):
if index:
#parse the characters buffer
index = int(index)
#if is not the first feature
if i > 0:
#add the separator
f.write(", ")
#add 0's until index
while i < index:
f.write("0, ")
i += 1
#and write 1
f.write("1")
i += 1
#reset the characters buffer
index = ""
#if is not a space or end on line
else:
#add the character to the buffer
index += character
#if the last line didn't end with a carriage return,
#index could be waiting to be parsed
if index:
index = int(index)
if i > 0:
f.write(", ")
while i < index:
f.write("0, ")
i += 1
f.write("1")
i += 1
index = ""
#fill with 0's
while i < count:
if i == 0:
f.write("0")
else:
f.write(", 0")
i += 1
f.write("\n")
f.close()
答案 3 :(得分:0)
让我们将您的代码重新编写为更简单的包,以便更好地利用Python的功能:
import sys
NUM_FEATURES = 12
with open(sys.argv[1]) as source, open(sys.argv[2], 'w') as sink:
for line in source:
list_line_sparse = [0] * NUM_FEATURES
indicies = map(int, line.rstrip().split())
for index in indicies:
list_line_sparse[index] = 1
print(*list_line_sparse, file=sink, sep=',')
我用“更有效率”重新审视了这个问题。虽然上面的内存效率更高,但是时间更慢。我重新考虑了你的原创,并提出了一个内存效率较低但比你的代码快2倍的解决方案:
import sys
NUM_FEATURES = 12
data = ''
with open(sys.argv[1]) as source:
for line in source:
list_line_sparse = ["0"] * NUM_FEATURES
indicies = map(int, line.rstrip().split())
for index in indicies:
list_line_sparse[index] = "1"
data += ",".join(list_line_sparse) + '\n'
with open(sys.argv[2], 'w') as sink:
sink.write(data)
与原始解决方案一样,它将所有数据存储在内存中并在最后写出来,这既是一个缺点(内存方面)又是一个优势(时间方面)。
<强> input.txt中强>
1 3 9 10
1 3 9 11
2 7 3 5
<强> USAGE 强>
% python3 test.py input.txt output.txt
<强> output.txt的强>
0,1,0,1,0,0,0,0,0,1,1,0
0,1,0,1,0,0,0,0,0,1,0,1
0,0,1,1,0,1,0,1,0,0,0,0