Question

我有一个很大的文件，其内容如下：

....

0.040027 a b c d e 12 34 56 78 90 12 34 56

0.050027 f g h i l 12 34 56 78 90 12 34 56

0.060027 a b c d e 12 34 56 78 90 12 34 56

0.070027 f g h i l 12 34 56 78 90 12 34 56

0.080027 a b c d e 12 34 56 78 90 12 34 56

0.090027 f g h i l 12 34 56 78 90 12 34 56

....

我需要以最快的方式拥有以下字典。

我使用以下代码：

ascFile = open('C:\\eample.txt', 'r', encoding='UTF-8')

tag1 = ' a b c d e '

tag2 = ' f g h i l '

tags = [tag1, tag2]

temp = {'k1':[], 'k2':[]}

key_tag = {'k1':tag1, 'k2':tag2 }

t1 = time.time()

for line in ascFile:

    for path, tag in key_tag.items():

        if tag in line:

            columns = line.strip().split(tag, 1)

            temp[path].append([columns[0], columns[-1].replace(' ', '')])

t2 = time.time()

print(t2-t1)

我在6秒内解析一个360MB的文件时得到以下结果，我想缩短时间。

temp = {'k1'：[['0.040027'，'1234567890123456']，['0.060027'，'1234567890123456']，['0.080027'，'1234567890123456']]，'k2'：[['0.050027 '，'1234567890123456']，['0.070027'，'1234567890123456']，['0.090027'，'1234567890123456']] }

Answer 1

我假设您在文件中有固定数量的单词作为密钥。使用split断开字符串，然后从拆分列表中切出一部分来直接计算密钥：

import collections

# raw strings don't need \\ for backslash:
FILESPEC = r'C:\example.txt'

lines_by_key = collections.defaultdict(list)

with open(FILESPEC, 'r', encoding='UTF-8') as f:
    for line in f:
        cols = line.split()
        key = ' '.join(cols[1:6])
        pair = (cols[0], ''.join(cols[6:]) # tuple, not list, could be changed
        lines_by_key[key].append(pair)

print(lines_by_key)

Answer 2

我使用分区而不是分割，以便可以一次通过“ in”测试和分割。

for line in ascFile:

    for path, tag in key_tag.items():

        val0, tag_found, val1 = line.partition(tag)

        if tag_found:
            temp[path].append([val0, val1.replace(' ', '')])
            break

您的360MB文件更好吗？

您可能还会做一个简单的测试，您要做的就是一次遍历文件一行：

for line in ascFile:
    pass

这将告诉您您可能的最佳时间。

文本文件解析最快

2 个答案: