Question

此文档有一个单词，每行有成千上万个浮点数，我想将其转换为以单词为键的字典，以及所有浮点数的向量。这就是我的工作方式，但是由于文件的大小（每个文件大约20k行，每个文件具有大约10k的值），该过程花费了一些时间。我找不到更有效的解析方式。只是一些不能保证减少运行时间的替代方法。

with open("googlenews.word2vec.300d.txt") as g_file:
  i = 0;
  #dict of words: [lots of floats]
  google_words = {}

  for line in g_file:
    google_words[line.split()[0]] = [float(line.split()[i]) for i in range(1, len(line.split()))]

Answer 1

在您的解决方案中，每个单词慢line.split()进行两次。考虑以下修改：

with open("googlenews.word2vec.300d.txt") as g_file:
    i = 0;
    #dict of words: [lots of floats]
    google_words = {}

    for line in g_file:
        word, *numbers = line.split()
        google_words[word] = [float(number) for number in numbers]

我在这里使用的一个高级概念是“拆包”： word, *numbers = line.split()

Python允许将可迭代的值分解为多个变量：

a, b, c = [1, 2, 3]
# This is practically equivalent to
a = 1
b = 2
c = 3

*是“将剩菜剩饭放入list并将列表分配给名称”的快捷方式：

a, *rest = [1, 2, 3, 4]
# results in
a == 1
rest == [2, 3, 4]

Answer 2

请不要多次拨打line.split()。

with open("googlenews.word2vec.300d.txt") as g_file:
    i = 0;
    #dict of words: [lots of floats]
    google_words = {}

    for line in g_file:
        temp = line.split()
        google_words[temp[0]] = [float(temp[i]) for i in range(1, len(temp))]

这是此类文件的简单生成器：

s = "x"
for i in range (10000):
    s += " 1.2345"
print (s)

以前的版本需要一些时间。仅有一个split调用的版本是即时的。

Answer 3

您还可以使用csv模块，该模块应该比您正在做的事情更有效率。

那会是这样的：

import csv

d = {}
with (open("huge_file_so_huge.txt", "r")) as g_file:
    for row in csv.reader(g_file, delimiter=" "):
        d[row[0]] = list(map(float, row[1:]))

Python将字符串解析为文件浮点列表的有效方法

3 个答案: