此文档有一个单词,每行有成千上万个浮点数,我想将其转换为以单词为键的字典,以及所有浮点数的向量。 这就是我的工作方式,但是由于文件的大小(每个文件大约20k行,每个文件具有大约10k的值),该过程花费了一些时间。我找不到更有效的解析方式。只是一些不能保证减少运行时间的替代方法。
with open("googlenews.word2vec.300d.txt") as g_file:
i = 0;
#dict of words: [lots of floats]
google_words = {}
for line in g_file:
google_words[line.split()[0]] = [float(line.split()[i]) for i in range(1, len(line.split()))]
答案 0 :(得分:5)
在您的解决方案中,每个单词慢line.split()
进行两次。考虑以下修改:
with open("googlenews.word2vec.300d.txt") as g_file:
i = 0;
#dict of words: [lots of floats]
google_words = {}
for line in g_file:
word, *numbers = line.split()
google_words[word] = [float(number) for number in numbers]
我在这里使用的一个高级概念是“拆包”:
word, *numbers = line.split()
Python允许将可迭代的值分解为多个变量:
a, b, c = [1, 2, 3]
# This is practically equivalent to
a = 1
b = 2
c = 3
*
是“将剩菜剩饭放入list
并将列表分配给名称”的快捷方式:
a, *rest = [1, 2, 3, 4]
# results in
a == 1
rest == [2, 3, 4]
答案 1 :(得分:3)
请不要多次拨打line.split()
。
with open("googlenews.word2vec.300d.txt") as g_file:
i = 0;
#dict of words: [lots of floats]
google_words = {}
for line in g_file:
temp = line.split()
google_words[temp[0]] = [float(temp[i]) for i in range(1, len(temp))]
这是此类文件的简单生成器:
s = "x"
for i in range (10000):
s += " 1.2345"
print (s)
以前的版本需要一些时间。
仅有一个split
调用的版本是即时的。
答案 2 :(得分:1)
您还可以使用csv模块,该模块应该比您正在做的事情更有效率。
那会是这样的:
import csv
d = {}
with (open("huge_file_so_huge.txt", "r")) as g_file:
for row in csv.reader(g_file, delimiter=" "):
d[row[0]] = list(map(float, row[1:]))