我有一个包含数千行作为字符串的txt文件。 每行以'#integer'的格式开头。例如'#100'。
我按顺序读取txt文件(第1行,第2行,#3 ..)并获取我想要的特定数组,其中数组是行号的集合以及连接到这些行的其他行:< / p>
数组采用以下形式:
[ ['#355', '#354', '#357', '#356'], ['#10043', '#10047', '#10045'], ['#1221', '#1220', '#1223', '#1222', '#1224'], [...] ]
它可以包含数百个数字。 (这是因为我有一系列数字,并且与他们相关的更多儿童与每个子数组相关联。)
我在下面的函数之前读过我的txt文件,这意味着首先我读取我的txt文件,提取数字,然后将其作为数组传递给extended_Strings
函数,替换每个数字使用txt文件中该数字行的实际字符串。
def extended_strings(matrix,base_txt):
string_matrix = matrix #new matrix to contain our future strings
for numset in string_matrix:
for num in numset:
for line in base_txt:
results = re.findall(r'^#\d+', line) #find the line # at start of string
if len(results) > 0 and results[0] == num: #if we have a # line that matches our # in the numset
index = numset.index(num) #find index of line # in the numset
numset[index] = line #if we match line #'s, we replace the line # with the actual string from the txt
return string_matrix
我试图让这个过程更短,更高效,例如我在txt中有150,000个字符串,有数百万次使用行for line in base_txt
扫描txt文件。
有什么建议吗?
答案 0 :(得分:1)
我没有做任何计量。但我相信这可能会有所帮助。 另一方面,仍有很多改进空间。
的text.txt:
#1 This is line #00001
#2 This is line #00002
#30 This is line #00030
#35 This is line #00035
#77 This is line #00077
#101 This is line #00101
#145 This is line #00145
#1010 This is line #01010
#8888 This is line #08888
#13331 This is line #13331
#65422 This is line #65422
代码:
import re
# reo = re.compile(r'^(#\d+)\s+(.*)\n$') # exclude line numbers in "string_matrix"
reo = re.compile(r'^((#\d+)\s+.*)\n$') # include line numbers in "string_matrix"
def file_to_dict(file_name):
file_dict = {}
with open(file_name) as f:
for line in f:
mo = reo.fullmatch(line)
# file_dict[mo.group(1)] = mo.group(2) # exclude line numbers in "string_matrix"
file_dict[mo.group(2)] = mo.group(1) # include line numbers in "string_matrix"
return file_dict
def extended_strings(matrix, file_dict):
string_matrix = []
for numset in matrix:
new_numset = []
for num in numset:
new_numset.append(file_dict[num])
string_matrix.append(new_numset)
return string_matrix
matrix = [['#1010', '#35', '#2', '#145', '#8888'], ['#30', '#2'], ['#65422', '#1', '#13331', '#77', '#101', '#8888']]
file_dict = file_to_dict('text.txt')
string_matrix = extended_strings(matrix, file_dict)
for list_ in string_matrix:
for line in list_:
print(line)
print()
答案 1 :(得分:0)
感谢Werner Wenzel的帮助, 我找到了适用于我的解决方案,并希望在此分享:
import re
def file_to_dict(file_name):
file_dict = {}
with open(file_name) as f:
for line in f:
stg = re.findall("(.+)",line)
stgNum = re.findall("#\d{1,10}",line)
file_dict[stgNum[0]] = stg[0]
return file_dict
def extended_strings(matrix, file_dict):
string_matrix = []
for numset in matrix:
new_numset = []
for num in numset:
new_numset.append(file_dict[num])
string_matrix.append(new_numset)
return string_matrix
matrix = [['#1010', '#35', '#2', '#145', '#8888'], ['#30', '#2'], ['#65422', '#1', '#13331', '#77', '#101', '#8888']]
file_dict = file_to_dict('text.txt')
string_matrix = extended_strings(matrix, file_dict)
for list_ in string_matrix:
for line in list_:
print line
print "done"