我有三列(h1,h2,h3)分别代表日,月和年,例如
import pandas as pd
df = pd.DataFrame({
'h1': [1,2,3],
'h2': [1,2,3],
'h3': [2000,2001,2002]
})
当我表演时:
pd.to_datetime(df[['h1', 'h2', 'h3']])
这个结果导致错误:ValueError: to assemble mappings requires at least that [year, month, day] be specified: [day,month,year] is missing
但是当我重命名列然后执行pd.to_datetime时,例如
df=df.rename(columns ={'h1':'day', 'h2':'month', 'h3': 'year'})
df["date_col"] =pd.to_datetime(df[['day','month','year']])
关于它我得到年份专栏,我们必须这样做吗?或者是否可以提供一种格式,以便可以分别检测列为日,月,年? 感谢。
答案 0 :(得分:4)
重新命名列的方法已经很明智了,因为文档说:
实施例
从DataFrame的多个列组装日期时间。按键 可以是常见的缩写,如['年','月','天','分钟', 'second','ms','us','ns'])或复数相同的
但也有一些选择。根据我的经验,使用zip的列表理解非常快(对于小集合)。大约3000行数据重命名列成为最快的。看一下这个图表,重命名的惩罚对于小集来说是很难的,但会补偿大的集合。
pd.to_datetime(['-'.join(map(str,i)) for i in zip(df['h3'],df['h2'],df['h1'])])
pd.to_datetime(['-'.join(i) for i in df[['h3', 'h2', 'h1']].values.astype(str)])
df[['h3','h2','h1']].astype(str).apply(lambda x: pd.to_datetime('-'.join(x)), 1)
pd.to_datetime(df[['h1','h2','h3']].rename(columns={'h1':'day', 'h2':'month','h3':'year'}))
#df = pd.concat([df]*1000)
2.74 ms ± 33.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
8.08 ms ± 158 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
158 ms ± 472 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
2.64 ms ± 104 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
100 loops, best of 3: 6.1 ms per loop
100 loops, best of 3: 12.7 ms per loop
1 loop, best of 3: 335 ms per loop
100 loops, best of 3: 4.7 ms per loop
使用我编写的代码更新(如果您有改进建议或任何可以提供帮助的库,请高兴):
import pandas as pd
import numpy as np
import timeit
import matplotlib.pyplot as plt
from collections import defaultdict
df = pd.DataFrame({
'h1': np.arange(1,11),
'h2': np.arange(1,11),
'h3': np.arange(2000,2010)
})
myfuncs = {
"pd.to_datetime(['-'.join(map(str,i)) for i in zip(df['h3'],df['h2'],df['h1'])])":
lambda: pd.to_datetime(['-'.join(map(str,i)) for i in zip(df['h3'],df['h2'],df['h1'])]),
"pd.to_datetime(['-'.join(i) for i in df[['h3','h2', 'h1']].values.astype(str)])":
lambda: pd.to_datetime(['-'.join(i) for i in df[['h3','h2', 'h1']].values.astype(str)]),
"pd.to_datetime(df[['h1','h2','h3']].rename(columns={'h1':'day','h2':'month','h3':'year'}))":
lambda: pd.to_datetime(df[['h1','h2','h3']].rename(columns={'h1':'day','h2':'month','h3':'year'}))
}
d = defaultdict(dict)
step = 10
cont = True
while cont:
lendf = len(df); print(lendf)
for k,v in mycodes.items():
iters = 1
t = 0
while t < 0.2:
ts = timeit.repeat(v, number=iters, repeat=3)
t = min(ts)
iters *= 10
d[k][lendf] = t/iters
if t > 2: cont = False
df = pd.concat([df]*step)
pd.DataFrame(d).plot().legend(loc='upper center', bbox_to_anchor=(0.5, -0.15))
plt.yscale('log'); plt.xscale('log'); plt.ylabel('seconds'); plt.xlabel('df rows')
plt.show()
返回: