Question

我有一个由大字符串组成的数据集（从~300 pptx文件中提取的文本）。通过使用pandas apply我正在执行＆＃34;平均值＆＃34;函数在每个字符串上，平均值为每个字查找相应的字向量，将其与另一个向量相乘并返回平均相关性。

然而，在大字符串上迭代并应用该函数需要花费大量时间，我想知道我可以采取哪些方法来加速以下代码：

#retrieve word vector from words df
def vec(w):
     return words.at[w]

#calculates the cosine distance between two vectors
def cosine_dist(a,b):
    codi = 1 - spatial.distance.cosine(a, b)
    return codi

#calculate the average cosine distance of the whole string and a given word vector
v_search = vec("test")
def Average(v_search, tobe_parsed):
    word_total = 0
    mean = 0
    for word in tobe_parsed.split():
        try: #word exists
            cd = cosine_dist(vec(word), v_search)
            mean += cd
            word_total += 1 

        except: #word does not exists    
            pass

    average = mean / word_total
    return(average)
df['average'] = df['text'].apply(lambda x: average(x))

我一直在研究编写代码的其他方法（例如df.loc - ＆gt; df.at），cython和多线程，但我的时间有限，所以我也不想浪费很多时候采用效率较低的方法。

提前致谢

Answer 1

您需要利用矢量化和numpy广播。让pandas返回单词索引列表，使用它们索引词汇表数组并创建一个单词向量矩阵（行数等于单词数），然后使用广播来计算余弦距离并计算它的平均值。 / p>

Answer 2

非常感谢vumaasha！这确实是要走的路（速度从约15分钟增加到约7秒！：o）

基本上代码已被重写为：

def Average(v_search,text):
        wordvec_matrix = words.loc[text.split()]
        return np.sum(cos_cdist(wordvec_matrix,v_search))/wordvec_matrix.shape[0]
df['average'] = df['text'].apply(lambda x: average(x))

加速Python NLP文本解析

2 个答案: