我使用Cloudera VM 5.2和pandas 0.18.0。
我有以下数据
adclicksDF = pd.read_csv('/home/cloudera/Eglence/ad-clicks.csv',
parse_dates=['timestamp'],
skipinitialspace=True).assign(adCount=1)
adclicksDF.head(n=5)
Out[107]:
timestamp txId userSessionId teamId userId adId adCategory \
0 2016-05-26 15:13:22 5974 5809 27 611 2 electronics
1 2016-05-26 15:17:24 5976 5705 18 1874 21 movies
2 2016-05-26 15:22:52 5978 5791 53 2139 25 computers
3 2016-05-26 15:22:57 5973 5756 63 212 10 fashion
4 2016-05-26 15:22:58 5980 5920 9 1027 20 clothing
adCount
0 1
1 1
2 1
3 1
4 1
数据类型字段
for col in adclicksDF:
print(col)
print(type(adclicksDF[col][1]))
timestamp
<class 'pandas.tslib.Timestamp'>
txId
<class 'numpy.int64'>
userSessionId
<class 'numpy.int64'>
teamId
<class 'numpy.int64'>
userId
<class 'numpy.int64'>
adId
<class 'numpy.int64'>
adCategory
<class 'str'>
adCount
<class 'numpy.int64'>
我想在时间戳中截断分钟和秒。
我试过
adclicksDF["timestamp"] = pd.to_datetime(adclicksDF["timestamp"],format='%Y-%m-%d %H')
adclicksDF.head(n=5)
Out[110]:
timestamp txId userSessionId teamId userId adId adCategory \
0 2016-05-26 15:13:22 5974 5809 27 611 2 electronics
1 2016-05-26 15:17:24 5976 5705 18 1874 21 movies
2 2016-05-26 15:22:52 5978 5791 53 2139 25 computers
3 2016-05-26 15:22:57 5973 5756 63 212 10 fashion
4 2016-05-26 15:22:58 5980 5920 9 1027 20 clothing
adCount
0 1
1 1
2 1
3 1
4 1
这不会截断分钟和秒数。
如何截断分钟和秒?
答案 0 :(得分:3)
您可以使用:
adclicksDF["timestamp"] = pd.to_datetime(adclicksDF["timestamp"])
.apply(lambda x: x.replace(minute=0, second=0))
print (adclicksDF)
timestamp txId userSessionId teamId userId adId adCategory
0 2016-05-26 15:00:00 5974 5809 27 611 2 electronics
1 2016-05-26 15:00:00 5976 5705 18 1874 21 movies
2 2016-05-26 15:00:00 5978 5791 53 2139 25 computers
3 2016-05-26 15:00:00 5973 5756 63 212 10 fashion
4 2016-05-26 15:00:00 5980 5920 9 1027 20 clothing
print (type(adclicksDF.ix[0, 'timestamp']))
<class 'pandas.tslib.Timestamp'>
如果需要输出string
,请使用dt.strftime
:
adclicksDF["timestamp"] = pd.to_datetime(adclicksDF["timestamp"]).dt.strftime('%Y-%m-%d %H')
print (adclicksDF)
timestamp txId userSessionId teamId userId adId adCategory
0 2016-05-26 15 5974 5809 27 611 2 electronics
1 2016-05-26 15 5976 5705 18 1874 21 movies
2 2016-05-26 15 5978 5791 53 2139 25 computers
3 2016-05-26 15 5973 5756 63 212 10 fashion
4 2016-05-26 15 5980 5920 9 1027 20 clothing
print (type(adclicksDF.ix[0, 'timestamp']))
<class 'str'>
编辑:
更好的解决方案是使用dt.floor
,就像回答Alex
答案 1 :(得分:2)
pd.Timestamp
具有floor分辨率方法,因为0.18
adclicksDF["timestamp"] = adclicksDF.timestamp.dt.floor('h')
答案 2 :(得分:0)