根据另一列中的分箱值在新列中创建标志

时间:2019-10-18 16:03:12

标签: python-3.x pandas dataframe

我有一个数据帧,其列为time_bin,是对hours的分箱:

df= unique_id   time_bin
    s_001       2-3
    s_002       5-8
    s_003       3-6
    s_004       2-7
    s_005       5-9 

我只想创建一个数据列,其列的范围从0到24,如0-1,1-2,2-3 ...... 23-24,并将列的标志升为'1',即time_bin列的范围内,其他列将为'0'。例如:

new_df= unique_id   time_bin  0-1 1-2 2-3 3-4 4-5 5-6 6-7 7-8 8-9 9-10.............. 23-24
        s_001       2-5    0  1    1   1   1   0   0   0   0   0  .................    0
        s_002       6-8    0  0    0   0   0   0   1   0   0   0  .................    0
        s_003       8-10   0  0    0   0   0   0   0   0   1   1  .................    0
        s_004       2-7    0  0    1   1   1   1   1   0   0   0  .................    0
        .....       ......
        .....       ......

3 个答案:

答案 0 :(得分:0)

尝试一下:

import pandas as pd

df=pd.DataFrame({"unique_id": ["s_001", "s_002", "s_003", "s_004", "s_005"], "time_bin": ["2-3", "5-8", "3-6", "2-7", "5-9"]})
for el in range(24):
   df[str(el)+"-"+str(el+1)]=0

df2=df["time_bin"].apply(lambda x: pd.Series({str(el)+"-"+str(el+1): 1 for el in range(int(x.split("-")[0]), int(x.split("-")[1]))})).fillna(0).astype("int")
df[df2.columns]=df2
print(df)

输出:

  unique_id time_bin  0-1  1-2  ...  20-21  21-22  22-23  23-24                                                         
0     s_001      2-3    0    0  ...      0      0      0      0                                                         
1     s_002      5-8    0    0  ...      0      0      0      0                                                         
2     s_003      3-6    0    0  ...      0      0      0      0                                                         
3     s_004      2-7    0    0  ...      0      0      0      0                                                         
4     s_005      5-9    0    0  ...      0      0      0      0                                                         
[5 rows x 26 columns]                                       
[Program finished]

答案 1 :(得分:0)

这很好:

df = pd.DataFrame({
    'unique_id': ['s_001', 's_002', 's_003', 's_004', 's_005'],
    'time_bin': ['2-3', '5-8', '3-6', '2-7', '5-9']
})

def hour_in_interval(interval, hour):
    first = int(interval[0])
    last = int(interval[2])
    if first <= hour < last:
        return 1
    else:
        return 0

hours = pd.DataFrame(
    {'{}-{}'.format(i, i+1): df.time_bin.apply(hour_in_interval, hour=i) for i in range(24)}
)

df = pd.concat([df, hours], axis=1)

答案 2 :(得分:0)

您可以使用pd.arrays.IntervalArray和listcomp完成此操作

s = df.time_bin.str.split('-')
ia_bins = pd.arrays.IntervalArray.from_arrays(s.str[0].astype(int),
                                              s.str[1].astype(int), closed='both')
ia_cols = pd.arrays.IntervalArray.from_breaks(range(0,25), closed='both')
ia_arr = [ia_cols.overlaps(x).astype(int) for x in ia_bins]

new_df = df.join(pd.DataFrame(ia_arr, columns=ia_cols).rename(lambda x: f'{x.left}-{x.right}', axis=1))

  unique_id time_bin  0-1  1-2  2-3  3-4  4-5  5-6  6-7  7-8  8-9  9-10  \
0     s_001      2-3    0    1    1    1    0    0    0    0    0     0
1     s_002      5-8    0    0    0    0    1    1    1    1    1     0
2     s_003      3-6    0    0    1    1    1    1    1    0    0     0
3     s_004      2-7    0    1    1    1    1    1    1    1    0     0
4     s_005      5-9    0    0    0    0    1    1    1    1    1     1

   10-11  11-12  12-13  13-14  14-15  15-16  16-17  17-18  18-19  19-20  \
0      0      0      0      0      0      0      0      0      0      0
1      0      0      0      0      0      0      0      0      0      0
2      0      0      0      0      0      0      0      0      0      0
3      0      0      0      0      0      0      0      0      0      0
4      0      0      0      0      0      0      0      0      0      0

   20-21  21-22  22-23  23-24
0      0      0      0      0
1      0      0      0      0
2      0      0      0      0
3      0      0      0      0
4      0      0      0      0

注意:如果您更喜欢pd.IntervalIndex

,也可以使用pd.arrays.IntervalArray