Question

我有一个文本文件，如下所示：（来自ipython） cat path_to_file

0   0.25    truth fact 
1   0.25    train home find travel
........
199 0.25    video box store office

我有另一个清单

vec = [(76, 0.04334748761500331),
 (128, 0.03697806086341099),
 (81, 0.03131634819532892),
 (1, 0.03131634819532892)]

现在我想只从第一列文本文件中获取vec中匹配的第一列，并显示第二列vec，第三列来自文本文件作为输出。

如果我的文本文件格式与vec相同，我可以使用set（a）＆amp;组（b）中。但是测试文件中的值是以标签间隔的（这是在执行以下操作时的样子）

open（path_to_file）为f： lines = f.read（）。splitlines（）

输出是：

['0\t0.25\ttruth fact lie
.........................
 '198\t0.25\tfan genre bit enjoy ',
 '199\t0.25\tvideo box store office  ']

Answer 1

使用NumPy：

import numpy as np
import numpy.lib.recfunctions as rfn

dtype = [('index', int), ('text', object)]
table = np.loadtxt(path_to_file, dtype=dtype, usecols=(0,2), delimiter='\t')

dtype = [('index', int), ('score', float)]
array = np.array(vec, dtype=dtype)

joined = rfn.join_by('index', table, array)

for row in joined:
      print row['index'], row['score'], row['text']

如果您非常关心性能，可以使用np.savetxt()来输出，但我认为这样更容易理解。

Answer 2

将vec转换为dict并使用"\t"作为分隔符拆分行应该有效：

vecdict = dict(vec)

output = []
for l in open('path_to_file'):
    words = l.split('\t')
    key = float(words[0])
    if vecdict.has_key(key):
        output.append("%s %f %s"%(words[0], vecdict[key], ' '.join(words[2:])) )

output应该是一个字符串列表。

PS：如果您有多个分隔符或不确定哪个分隔符可以使用重复调用split或re，例如

print re.findall("[\w]+", "this has    multiple delimiters\tHere")

>> ["this", "has", "multiple", "delimiters", "Here"]

从python列表中的文本文件中获取匹配列

2 个答案: