我试图将一个将时间戳截断的函数传递给单列。它正在执行该功能,但返回一个列表。我希望保留数据结构。
df = pd.DataFrame({
'Time' : ['8:03:001','8:17:004','8:20:003','8:28:002','8:35:004','8:40:006','8:42:002','8:45:004','8:50:009'],
'Place' : ['House 1','House 1','House 1','House 2','House 2','House 2','House 3','House 3','House 3'],
})
def truncate_time(col):
col = [x[:-2] for x in col]
return col
df1 = (truncate_time(df['Time']))
预期输出:
Time Place
0 8:03:0 House 1
1 8:17:0 House 1
2 8:20:0 House 1
3 8:28:0 House 2
4 8:35:0 House 2
5 8:40:0 House 2
6 8:42:0 House 3
7 8:45:0 House 3
8 8:50:0 House 3
答案 0 :(得分:4)
您可以分配:
df['Time'] = truncate_time(df['Time'])
print (df)
Time Place
0 8:03:0 House 1
1 8:17:0 House 1
2 8:20:0 House 1
3 8:28:0 House 2
4 8:35:0 House 2
5 8:40:0 House 2
6 8:42:0 House 3
7 8:45:0 House 3
8 8:50:0 House 3
但是这里也可以将str
与索引一起使用:
df['Time'] = df['Time'].str[:-2]
或lambda函数:
df['Time'] = df['Time'].apply(lambda col: col[:-2])
或者通过Series.apply
的移除列表理解功能来简化解决方案:
def truncate_time(col):
return col[:-2]
df['Time'] = df['Time'].apply(truncate_time)
列表理解的最后一个解决方案:
df['Time'] = [x[:-2] for x in df['Time']]
编辑:可能存在缺失值的性能-取决于值的数量以及缺失值的数量:
#added one row with missing value
df = pd.DataFrame({
'Time' : ['8:03:001','8:17:004','8:20:003','8:28:002','8:35:004','8:40:006','8:42:002','8:45:004','8:50:009',np.nan],
'Place' : ['House 1','House 1','House 1','House 2','House 2','House 2','House 3','House 3','House 3','House 3'],
})
def truncate_time(col):
return col[:-2] if col == col else col
#[1000000 rows x 2 columns]
df = pd.concat([df] * 100000, ignore_index=True)
In [104]: %timeit df['Time1'] = df['Time'].str[:-2]
460 ms ± 20.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [105]: %timeit df['Time2'] = [x[:-2] if x == x else x for x in df['Time']]
445 ms ± 9.72 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [106]: %timeit df['Time3'] = df['Time'].apply(lambda col: col[:-2] if col == col else col)
428 ms ± 18.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [107]: %timeit df['Time4'] = df['Time'].apply(truncate_time)
416 ms ± 8.28 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)