我有两个pandas数据帧。第一个包含从文本中提取的unigrams列表,文本中出现的unigram的计数和概率。结构如下所示:
unigram_df
word count prob
0 we 109 0.003615
1 investigated 20 0.000663
2 the 1125 0.037315
3 potential 36 0.001194
4 of 1122 0.037215
第二个包含从同一文本中提取的跳过列表,以及文本中发生的跳过索引的计数和概率。它看起来像这样:
skipgram_df
word count prob
0 (we, investigated) 5 0.000055
1 (we, the) 31 0.000343
2 (we, potential) 2 0.000022
3 (investigated, the) 11 0.000122
4 (investigated, potential) 3 0.000033
现在,我想计算每个跳数的逐点互信息,这基本上是一个跳过概率的对数除以其unigrams'的乘积。概率。我为此编写了一个函数,它迭代了skipgram df并且它正是我想要的工作方式,但是我的性能存在很大问题,我想问一下是否有办法改进我的代码以使其计算pmi快点。
这是我的代码:
def calculate_pmi(row):
skipgram_prob = float(row[3])
x_unigram_prob = float(unigram_df.loc[unigram_df['word'] == row[1][0]]
['prob'])
y_unigram_prob = float(unigram_df.loc[unigram_df['word'] == row[1][1]]
['prob'])
pmi = math.log10(float(skipgram_prob / (x_unigram_prob * y_unigram_prob)))
result = str(str(row[1][0]) + ' ' + str(row[1][1]) + ' ' + str(pmi))
return result
pmi_list = list(map(calculate_pmi, skipgram_df.itertuples()))
现在该功能的性能约为483.18it / s,这是超级慢的,因为我有数十万个跳过迭代。欢迎大家提出意见。感谢。
答案 0 :(得分:1)
对pandas
的新用户来说,这是一个很好的问题和练习。仅使用df.iterrows
作为最后的手段,即使这样,也要考虑替代方案。这是正确选择的情况相对较少。
以下是如何进行计算矢量化的示例。
import pandas as pd
import numpy as np
uni = pd.DataFrame([['we', 109, 0.003615], ['investigated', 20, 0.000663],
['the', 1125, 0.037315], ['potential', 36, 0.001194],
['of', 1122, 0.037215]], columns=['word', 'count', 'prob'])
skip = pd.DataFrame([[('we', 'investigated'), 5, 0.000055],
[('we', 'the'), 31, 0.000343],
[('we', 'potential'), 2, 0.000022],
[('investigated', 'the'), 11, 0.000122],
[('investigated', 'potential'), 3, 0.000033]],
columns=['word', 'count', 'prob'])
# first split column of tuples in skip
skip[['word1', 'word2']] = skip['word'].apply(pd.Series)
# set index of uni to 'word'
uni = uni.set_index('word')
# merge prob1 & prob2 from uni to skip
skip['prob1'] = skip['word1'].map(uni['prob'].get)
skip['prob2'] = skip['word2'].map(uni['prob'].get)
# perform calculation and filter columns
skip['result'] = np.log(skip['prob'] / (skip['prob1'] * skip['prob2']))
skip = skip[['word', 'count', 'prob', 'result']]