我有两个数据帧,一个是topic_
,它是目标数据帧,tw
是源数据帧。 topic_
是按字矩阵的主题,其中每个单元格存储在特定主题中出现的单词的概率。我使用numpy.zeros将topic_
数据框初始化为零。 tw
数据框的样本 -
print(tw)
topic_id word_prob_pair
0 0 [(customer, 0.061703717964), (team, 0.01724444...
1 1 [(team, 0.0260560163563), (customer, 0.0247838...
2 2 [(customer, 0.0171786268847), (footfall, 0.012...
3 3 [(team, 0.0290787264225), (product, 0.01570401...
4 4 [(team, 0.0197917953222), (data, 0.01343226630...
5 5 [(customer, 0.0263740639141), (team, 0.0251677...
6 6 [(customer, 0.0289764173735), (team, 0.0249938...
7 7 [(client, 0.0265082412402), (want, 0.016477447...
8 8 [(customer, 0.0524006965405), (team, 0.0322975...
9 9 [(generic, 0.0373422774996), (product, 0.01834...
10 10 [(customer, 0.0305256248248), (team, 0.0241559...
11 11 [(customer, 0.0198707090364), (ad, 0.018516805...
12 12 [(team, 0.0159852971954), (customer, 0.0124540...
13 13 [(team, 0.033444510469), (store, 0.01961003290...
14 14 [(team, 0.0344793243818), (customer, 0.0210975...
15 15 [(team, 0.026416114692), (customer, 0.02041691...
16 16 [(campaign, 0.0486186973667), (team, 0.0236024...
17 17 [(customer, 0.0208270072145), (branch, 0.01757...
18 18 [(team, 0.0280889397541), (customer, 0.0127932...
19 19 [(team, 0.0297011415217), (customer, 0.0216007...
我的主题_数据框的大小为num_topics
(即20)number_of_unique_words
(在tw
数据框中)
以下是我用来替换topic_
数据框
for each_topic in range(num_topics):
a = tw['word_prob_pair'].iloc[each_topic]
for word, prob in a:
topic_.set_value(each_topic, word, prob)
有没有更好的方法来完成这项任务?
答案 0 :(得分:5)
您可以list comprehension
使用DataFrame
构造函数,最后将NaN
替换为0
fillna
:
df = pd.DataFrame({'word_prob_pair':[
[('customer', 0.061703717964), ('team', 0.01724444)],
[('team', 0.0260560163563), ('customer', 0.0247838)],
[('customer', 0.0171786268847), ('footfall', 0.012)],
[('team', 0.0290787264225), ('product', 0.01570401)],
[('team', 0.0197917953222), ('data', 0.01343226630)],
[('customer', 0.0263740639141), ('team', 0.0251677)],
[('customer', 0.0289764173735), ('team', 0.0249938)],
[('client', 0.0265082412402), ('want', 0.016477447)]
] })
print (df)
word_prob_pair
0 [(customer, 0.061703717964), (team, 0.01724444)]
1 [(team, 0.0260560163563), (customer, 0.0247838)]
2 [(customer, 0.0171786268847), (footfall, 0.012)]
3 [(team, 0.0290787264225), (product, 0.01570401)]
4 [(team, 0.0197917953222), (data, 0.0134322663)]
5 [(customer, 0.0263740639141), (team, 0.0251677)]
6 [(customer, 0.0289764173735), (team, 0.0249938)]
7 [(client, 0.0265082412402), (want, 0.016477447)]
df1 = pd.DataFrame([dict(x) for x in df.word_prob_pair])
df1 = df1.fillna(0)
print (df1)
client customer data footfall product team want
0 0.000000 0.061704 0.000000 0.000 0.000000 0.017244 0.000000
1 0.000000 0.024784 0.000000 0.000 0.000000 0.026056 0.000000
2 0.000000 0.017179 0.000000 0.012 0.000000 0.000000 0.000000
3 0.000000 0.000000 0.000000 0.000 0.015704 0.029079 0.000000
4 0.000000 0.000000 0.013432 0.000 0.000000 0.019792 0.000000
5 0.000000 0.026374 0.000000 0.000 0.000000 0.025168 0.000000
6 0.000000 0.028976 0.000000 0.000 0.000000 0.024994 0.000000
7 0.026508 0.000000 0.000000 0.000 0.000000 0.000000 0.016477
答案 1 :(得分:3)
numpy
tid1 = df.topic_id.values
lens = [len(i) for i in df.word_prob_pair.values]
tid2 = tid1.repeat(lens)
cat, prob = np.concatenate(df.word_prob_pair.values).T
ucat, inv = np.unique(cat, return_inverse=True)
data = np.zeros((len(tid1), len(ucat)), dtype=float)
data[tid2, inv] = prob
pd.DataFrame(data, tid1, ucat)
时间