Question

此处的新手程序员正在寻求帮助。我有一个看起来像这样的数据框：

  Cashtag      Date           Message  
0  $AAPL    2018-01-01   "Blah blah $AAPL"
1  $AAPL    2018-01-05   "Blah blah $AAPL"      
2  $AAPL    2019-01-08   "Blah blah $AAPL"     
3  $AAPL    2019-02-09   "Blah blah $AAPL"
4  $AAPL    2019-02-10   "Blah blah $AAPL"
5  $AAPL    2019-03-01   "Blah blah $AAPL"
6  $FB      2018-01-03   "Blah blah $FB"
7  $FB      2018-02-10   "Blah blah $FB"    
8  $FB      2018-02-11   "Blah blah $FB"   
9  $FB      2019-03-22   "Blah blah $FB" 
10 $AMZN    2018-04-13   "Blah blah $AMZN"
11 $AMZN    2018-04-29   "Blah blah $AMZN"
12 $AMZN    2019-07-23   "Blah blah $AMZN"     
13 $AMZN    2019-07-27   "Blah blah $AMZN"

我想要的输出是一个DataFrame，它告诉我每个公司的样本中每年每个月的消息数量。在此示例中为：

   Cashtag    Date    #Messages       
0  $AAPL    2018-01      02       
1  $AAPL    2019-01      01   
2  $AAPL    2019-02      02     
3  $AAPL    2019-03      01
4  $FB      2018-01      01
5  $FB      2018-02      02        
6  $FB      2019-03      01   
7  $AMZN    2018-04      02  
8  $AMZN    2019-07      02

我尝试了.groupby（）的许多组合，但尚未实现解决方案。

如何获得所需的输出？

Answer 1

尝试：

如果Date是string：

>>> df.groupby([df["Cashtag"], df["Date"].apply(lambda x: x[:7])]).agg({"Message": "count"}).reset_index()

如果Date是datetime：

>>> df.groupby([df["Cashtag"], df["Date"].apply(lambda x: "{0}-{1:02}".format(x.year, x.month))]).agg({"Message": "count"}).reset_index()

并输出：

  Cashtag     Date  Message
0   $AAPL  2018-01        2
1   $AAPL  2019-01        1
2   $AAPL  2019-02        2
3   $AAPL  2019-03        1
4   $AMZN  2018-04        2
5   $AMZN  2019-07        2
6     $FB  2018-01        1
7     $FB  2018-02        2
8     $FB  2019-03        1

Answer 2

有两个棘手的部分。一种是处理日期，另一种是groupby本身。

要仅按年份和月份分组，您需要从日期中提取它们。您可以使用字符串索引，或将“日期”列转换为datetimes并使用strftime对其进行格式化。我将使用第二种方法，因为我发现它更具可读性，并且作为学习点也很有用。

关于groupby的重点是，您可以向其传递列标签列表。然后，对那些列中值的每个唯一组合进行汇总。

# convert Date to datetimes
df['Date'] = pd.to_datetime(df['Date'])
# extract year and month from datetime objects with `strftime`
df['year-month'] = df['Date'].apply(lambda x: (x.strftime('%Y-%m')))
# groupby columns 'Cashtag' and 'year-month' and aggregate 'Message' using the `count` function
df.groupby(['Cashtag', 'year-month'])['Message'].count()

如果您不想创建新列，则可以在一行中完成：

df.groupby(['Cashtag', df['Date'].apply(lambda x: (x.strftime('%Y-%m')))])['Message'].count()

Answer 3

使用resample的解决方案：

import pandas as pd


data = [
    ('$AAPL', '2018-01-01', "Blah blah $AAPL"),
    ('$AAPL', '2018-01-05', "Blah blah $AAPL"),      
    ('$AAPL', '2019-01-08', "Blah blah $AAPL"),     
    ('$AAPL', '2019-02-09', "Blah blah $AAPL"),
    ('$AAPL', '2019-02-10', "Blah blah $AAPL"),
    ('$AAPL', '2019-03-01', "Blah blah $AAPL"),
    ('$FB',   '2018-01-03', "Blah blah $FB"),
    ('$FB',   '2018-02-10', "Blah blah $FB"),  
]

df = pd.DataFrame.from_records(data=data, columns=['Cashtag', 'Date', 'Message'])


df['Date'] = pd.to_datetime(df['Date'])

df = (df
    .set_index(pd.DatetimeIndex(df['Date']))
    .groupby('Cashtag')
    .resample('M')['Message']
    .count()
    .reset_index()
    .query('Message > 0')
    .reset_index(drop=True)
)
df['Date'] = df['Date'].dt.to_period('M')

输出：

  Cashtag     Date  Message
0   $AAPL  2018-01        2
1   $AAPL  2019-01        1
2   $AAPL  2019-02        2
3   $AAPL  2019-03        1
4     $FB  2018-01        1
5     $FB  2018-02        1

或更简单的解决方案：

df['Date'] = pd.to_datetime(df['Date']).dt.to_period('M')
df = df.groupby(['Cashtag', 'Date'])['Message'].count().reset_index()

DataFrame通过两列进行分组，并获取另一列的计数

3 个答案: