Question

我有一个要搜索的字符串列表以及一个单词列表，并希望生成表示字符串中每个单词存在的布尔特征向量。在Python中构建这些特征向量的最快方法是什么？

在下面的示例中，第一个字符串将输出[1,0,1,1]。我目前正在使用Aho Corasick算法（Python - find occurrences of list of strings within string）来搜索列表理解中的每个字符串，但认为可能有更快的方法。下面的代码使用此方法，并将平均时间超过10次。

import time
import numpy as np
import ahocorasick

def check_strings(A, search_list, string_to_search):
    """Use Aho Corasick algorithm to produce boolean list indicating
    prescence of strings within a longer string"""
    index_list = []
    for item in A.iter(string_to_search):
        index_list.append(item[1][0])

    output_list = np.array([0] * len(search_list))
    output_list[index_list] = 1
    return output_list.tolist()


word_list = ["foo", "bar", "hello", "world"]
strings_to_check = ["hello world foo", "foo bar", "bar world"]

A = ahocorasick.Automaton()
for idx, s in enumerate(word_list):
    A.add_word(s, (idx, s))
A.make_automaton()

run_times = []
for i in range(10):
    t0 = time.time()
    feature_vectors = [check_strings(A, word_list, s) for s in strings_to_check]
    run_times.append(time.time()-t0)

print(feature_vectors)
print(np.mean(np.array(run_times)))

输出是：

[[1, 0, 1, 1], [1, 1, 0, 0], [0, 1, 0, 1]]
1.65939331055e-05

构建字符串特征向量Python的最快方法

0 个答案: