Question

我有一个熊猫数据框，看起来像：

cleanText.head()
    source      word    count
0   twain_ess            988
1   twain_ess   works    139
2   twain_ess   short    139
3   twain_ess   complete 139
4   twain_ess   would    98
5   twain_ess   push     94

还有一个字典，其中包含每个来源的总字数：

titles
{'orw_ess': 1729, 'orw_novel': 15534, 'twain_ess': 7680, 'twain_novel': 60004}

我的目标是通过每个来源中的单词总数对每个来源的单词计数进行归一化，即将它们转换为百分比。看来这应该是微不足道的，但是python似乎使它变得非常困难（如果有人可以向我解释就地操作的规则，那将是很棒的）。

警告来自需要将cleanText中的条目过滤为仅来自单个来源的条目，然后我尝试就该子集的计数除以字典中的值。

# Adjust total word counts and normalize
for key, value in titles.items():

    # This corrects the total words for overcounting the '' entries
    overcounted= cleanText[cleanText.iloc[:,0]== key].iloc[0,2]
    titles[key]= titles[key]-overcounted

    # This is where I divide by total words, however it does not save inplace, or at all for that matter
    cleanText[cleanText.iloc[:,0]== key].iloc[:,2]= cleanText[cleanText.iloc[:,0]== key]['count']/titles[key]

如果任何人都可以解释如何更改此除法语句，以便将输出实际上保存在原始列中，那将会很棒。

谢谢

Answer 1

如果我理解正确：

cleanText['count']/cleanText['source'].map(titles)

哪个给你：

0    0.128646
1    0.018099
2    0.018099
3    0.018099
4    0.012760
5    0.012240
dtype: float64

要将这些百分比值重新分配到您的count列中，请使用：

cleanText['count'] = cleanText['count']/cleanText['source'].map(titles)

Python，Pandas：将数据框过滤为一个子集并就地更新此子集

1 个答案: