根据条件重复数据框的行

时间:2019-02-20 16:13:32

标签: python pandas group-by

我有一个熊猫数据框,看起来像这样:

hotel_id         date         length_of_stay     clicks
A               2019-01-01           3               7
B               2019-01-06           2               11
C               2019-01-03           1               4

我希望结果是:

hotel_id         date                            clicks
A               2019-01-01                          7
A               2019-01-02                          7
A               2019-01-03                          7
B               2019-01-06                          11
B               2019-01-07                          11
C               2019-01-03                          4

因此,我们看到某人每晚住宿酒店有多少点击...

我想不出一种优雅的方式来做..有人可以帮忙吗?

2 个答案:

答案 0 :(得分:3)

使用numpy.repeat()

m= pd.DataFrame(np.repeat(df.values,df.length_of_stay,axis=0),columns=df.columns)
m['date']=m.groupby('hotel_id')['date'].transform(lambda x: pd.date_range(start=x.iloc[0], periods=len(x)))

或:

newdf = pd.DataFrame(np.repeat(df.values,df.length_of_stay,axis=0),columns=df.columns)
newdf['date'] = [i for day, n in zip(df.date,df.length_of_stay) 
                   for i in pd.date_range(start=day, periods=n)]

完整示例:

import pandas as pd
import numpy as np

data = '''\
hotel_id         date         length_of_stay     clicks
A               2019-01-01           3               7
B               2019-01-06           2               11
C               2019-01-03           1               4'''

fileobj = pd.compat.StringIO(data)
df = pd.read_csv(fileobj, parse_dates=['date'], sep='\s+')

m= pd.DataFrame(np.repeat(df.values,df.length_of_stay,axis=0),columns=df.columns)
m['date']=m.groupby('hotel_id')['date'].transform(lambda x: pd.date_range(start=x.iloc[0], periods=len(x)))
print(m)

  hotel_id       date length_of_stay clicks
0        A 2019-01-01              3      7
1        A 2019-01-02              3      7
2        A 2019-01-03              3      7
3        B 2019-01-06              2     11
4        B 2019-01-07              2     11
5        C 2019-01-03              1      4

答案 1 :(得分:2)

这是利用“丑陋的” df.iterrows()的另一种解决方案:

newdf = pd.concat(pd.DataFrame({
        'hotel_id': row['hotel_id'],
        'date': pd.date_range(start=row['date'], periods=row['length_of_stay']),
        'length_of_stay': row['length_of_stay'],
        'clicks': row['clicks']
    }) for ind, row in df.iterrows())

完整示例:

import pandas as pd

data = '''\
hotel_id         date         length_of_stay     clicks
A               2019-01-01           3               7
B               2019-01-06           2               11
C               2019-01-03           1               4'''

fileobj = pd.compat.StringIO(data)
df = pd.read_csv(fileobj, parse_dates=['date'], sep='\s+')

newdf = pd.concat(pd.DataFrame({
    'hotel_id': row['hotel_id'],
    'date': pd.date_range(start=row['date'], periods=row['length_of_stay']),
    'length_of_stay': row['length_of_stay'],
    'clicks': row['clicks']
}) for ind, row in df.iterrows())

返回:

   clicks       date hotel_id  length_of_stay
0       7 2019-01-01        A               3
1       7 2019-01-02        A               3
2       7 2019-01-03        A               3
0      11 2019-01-06        B               2
1      11 2019-01-07        B               2
0       4 2019-01-03        C               1