Question

我有一个新闻数据框架，其中有一个专栏，列出了全年撰写的所有文章的标题；带有文章月份的另一列；以及将文章分类为正面，负面，平衡或信息性的列。

数据框如下所示（此处仅包含一月和三月的虚构示例）：

Headline                    month          tendency
'The US Economy xxxxxx'     January        positive
'The UN warns xxxxxxxx'     January        balanced
'Tesla xxxxxxxx'            March          positive

数据涵盖了所有月份，我想创建一个名为count的列，该列具有在特定月份发布的文章数，并且是肯定，否定，平衡或参考性的。例如，说一月份总共有40篇文章，其中20篇是肯定的，5篇是平衡的，5篇是信息性的，10篇是负面的。在三月份，您总共有30篇文章，有15篇正面文章，5篇负面文章，5篇平衡文章和5篇参考文献。在我要创建的列中，值将是前面所述的数字，具体取决于文章的趋势。因此，最终数据帧将如下所示：

Headline                    month          tendency     count
'The US Economy xxxxxx'     January        positive     20
'The UN warns xxxxxxxx'     January        balanced     5
'Tesla xxxxxx'              March          positive     15

重复计数的值无关紧要，我只需要引用即可。

我能够打印结果，并且逻辑运行得很好，但是我无法找到一种创建列并为每个月分配值的方法。

我的代码如下：

data[(data[month] == 'January') & (data['tendency'] == 'Positive')].count()

您可以更改月份和趋势，它将为您提供所需的结果。我应该为每个趋势写一个if语句吗？创建计数列的最佳方法是什么？

Answer 1

因此，您可以结合使用aggregation / groupby和join

例如像这样的东西：

# This is input, named 'df', I added a fourth headline to test the aggregation.
df = pd.DataFrame({'Headline' : ['The US Economy xxxxxx','The UN warns xxxxxxxx','Tesla xxxxxxxx','Tesla yyyyyyy'],
                      'month' : ['January','January','March','March'],
                   'tendency' : ['positive', 'balanced', 'positive', 'positive']})

# Make a series that counts articles by month and tendency
countByMonthTendency = df.groupby(['month','tendency']).size().rename('count')

# Join it back to your data on the same two columns.
df.join(countByMonthTendency, on=['month','tendency'])

产生：

    Headline                month    tendency   count
0   The US Economy xxxxxx   January  positive   1
1   The UN warns xxxxxxxx   January  balanced   1
2   Tesla xxxxxxxx          March    positive   2
3   Tesla yyyyyyy           March    positive   2

如何创建具有满足特定条件的特定实例计数的列？

1 个答案: