如果我们有以下数据:
timestamp = ['2016-01-09_14-49-18','2016-01-10_09-48-59','2016-01-10_09-50-29','2016-01-10_09-59-08','2016-01-10_10-33-01','2016-01-10_10-35-01','2016-01-10_10-39-05','2016-01-10_10-40-38','2016-01-10_10-50-55','2016-01-10_12-28-35','2016-01-10_15-13-34','2016-01-10_17-02-44','2016-01-10_17-05-48','2016-01-10_17-13-44','2016-01-10_17-15-52']
feature = ['A','A','B','C','B','C','C','A','A','A','B','A','C','C','A']
df = pd.DataFrame({'timestamp':timestamp, 'feature':feature})
如何为每个功能创建一个新列,表示该类是否出现,比方说,过去15分钟?
结果:
feature timestamp A B C
0 A 2016-01-09_14-49-18 1 0 0
1 A 2016-01-10_09-48-59 1 0 0
2 B 2016-01-10_09-50-29 1 1 0
3 C 2016-01-10_09-59-08 1 1 1
4 B 2016-01-10_10-33-01 0 1 0
5 C 2016-01-10_10-35-01 0 1 1
6 C 2016-01-10_10-39-05 0 1 1
7 A 2016-01-10_10-40-38 1 1 1
8 A 2016-01-10_10-50-55 1 0 1
9 A 2016-01-10_12-28-35 1 0 0
10 B 2016-01-10_15-13-34 0 1 0
11 A 2016-01-10_17-02-44 1 0 0
12 C 2016-01-10_17-05-48 1 0 1
13 C 2016-01-10_17-13-44 1 0 1
14 A 2016-01-10_17-15-52 1 0 1
其中1 =班级在最后15分钟出现,0 =班级没有出现。
答案 0 :(得分:1)
以下是其中一个功能的示例。您可以对循环中的其他列/功能重复此操作:
df = df.set_index(pd.to_datetime(df['timestamp'], format='%Y-%m-%d_%H-%M-%S'))
df['A'] = df.index
df['A'].loc[df['feature'] != 'A'] = np.NaN
df['A'] = df['A'].ffill()
df['A'] = df.index - df['A']
df['A'] = df['A'] < pd.to_timedelta('15m')
这导致以下数据框:
feature timestamp A
timestamp
2016-01-09 14:49:18 A 2016-01-09_14-49-18 True
2016-01-10 09:48:59 A 2016-01-10_09-48-59 True
2016-01-10 09:50:29 B 2016-01-10_09-50-29 True
2016-01-10 09:59:08 C 2016-01-10_09-59-08 True
2016-01-10 10:33:01 B 2016-01-10_10-33-01 False
2016-01-10 10:35:01 C 2016-01-10_10-35-01 False
2016-01-10 10:39:05 C 2016-01-10_10-39-05 False
2016-01-10 10:40:38 A 2016-01-10_10-40-38 True
2016-01-10 10:50:55 A 2016-01-10_10-50-55 True
2016-01-10 12:28:35 A 2016-01-10_12-28-35 True
2016-01-10 15:13:34 B 2016-01-10_15-13-34 False
2016-01-10 17:02:44 A 2016-01-10_17-02-44 True
2016-01-10 17:05:48 C 2016-01-10_17-05-48 True
2016-01-10 17:13:44 C 2016-01-10_17-13-44 True
2016-01-10 17:15:52 A 2016-01-10_17-15-52 True
如果您想要0和1而不是bool
,请在列上使用astype(int)
。
答案 1 :(得分:1)
from datetime import timedelta, datetime
# prepare cols
df["A"] = 0
df["B"] = 0
df["C"] = 0
# convert to datetime
df["timestamp"] = pd.to_datetime(df["timestamp"],format="%Y-%m-%d_%H-%M-%S")
feature_list = ["A","B","C"]
for row in df.iterrows():
curr_index = row[0]
curr_time = row[1][1]
temp_df = df.loc[(df.timestamp <= curr_time)&(df.timestamp > curr_time-timedelta(minutes=15))]
for feature_i in feature_list:
if feature_i in temp_df.feature.tolist():
df.loc[curr_index, feature_i] = 1
else:
df.loc[curr_index, feature_i] = 0
输出:
feature timestamp A B C
0 A 2016-01-09 14:49:18 1 0 0
1 A 2016-01-10 09:48:59 1 0 0
2 B 2016-01-10 09:50:29 1 1 0
3 C 2016-01-10 09:59:08 1 1 1
4 B 2016-01-10 10:33:01 0 1 0
5 C 2016-01-10 10:35:01 0 1 1
6 C 2016-01-10 10:39:05 0 1 1
7 A 2016-01-10 10:40:38 1 1 1
8 A 2016-01-10 10:50:55 1 0 1
9 A 2016-01-10 12:28:35 1 0 0
10 B 2016-01-10 15:13:34 0 1 0
11 A 2016-01-10 17:02:44 1 0 0
12 C 2016-01-10 17:05:48 1 0 1
13 C 2016-01-10 17:13:44 1 0 1
14 A 2016-01-10 17:15:52 1 0 1
答案 2 :(得分:1)
您可以使用:
df['timestamp'] = pd.to_datetime(df['timestamp'], format='%Y-%m-%d_%H-%M-%S')
for col in df['feature'].unique():
df[col] = df['timestamp'] - df['timestamp'].where(df['feature'] == col).ffill()
df[col] = (df[col] < pd.to_timedelta('15min')).astype(int)
print (df)
feature timestamp A B C
0 A 2016-01-09 14:49:18 1 0 0
1 A 2016-01-10 09:48:59 1 0 0
2 B 2016-01-10 09:50:29 1 1 0
3 C 2016-01-10 09:59:08 1 1 1
4 B 2016-01-10 10:33:01 0 1 0
5 C 2016-01-10 10:35:01 0 1 1
6 C 2016-01-10 10:39:05 0 1 1
7 A 2016-01-10 10:40:38 1 1 1
8 A 2016-01-10 10:50:55 1 0 1
9 A 2016-01-10 12:28:35 1 0 0
10 B 2016-01-10 15:13:34 0 1 0
11 A 2016-01-10 17:02:44 1 0 0
12 C 2016-01-10 17:05:48 1 0 1
13 C 2016-01-10 17:13:44 1 0 1
14 A 2016-01-10 17:15:52 1 0 1