我和熊猫一起工作。我有以下数据:
useradClick.head(n=5)
Out[291]:
timestamp userId adCategory adCount
0 2016-05-26 15:13:22 611 electronics 1
1 2016-05-26 15:17:24 1874 movies 1
2 2016-05-26 15:22:52 2139 computers 1
3 2016-05-26 15:22:57 212 fashion 1
4 2016-05-26 15:22:58 1027 clothing 1
我想将2016-05-26 15:13:22转换为2016-05-26 15.之后我想做一个小组
我试过
useradClickv1 = useradClick.select(pd.to_datetime('timestamp',format='%d%m%Y'))
但是我收到了错误
Traceback (most recent call last):
File "<ipython-input-292-9d5a6a59d577>", line 1, in <module>
useradClickv1 = useradClick.select(pd.to_datetime('timestamp',format='%d%m%Y'))
File "/home/cloudera/anaconda3/lib/python3.5/site-packages/pandas/util/decorators.py", line 91, in wrapper
return func(*args, **kwargs)
File "/home/cloudera/anaconda3/lib/python3.5/site-packages/pandas/tseries/tools.py", line 287, in to_datetime
unit=unit, infer_datetime_format=infer_datetime_format)
File "/home/cloudera/anaconda3/lib/python3.5/site-packages/pandas/tseries/tools.py", line 416, in _to_datetime
return _convert_listlike(np.array([arg]), box, format)[0]
File "/home/cloudera/anaconda3/lib/python3.5/site-packages/pandas/tseries/tools.py", line 402, in _convert_listlike
raise e
File "/home/cloudera/anaconda3/lib/python3.5/site-packages/pandas/tseries/tools.py", line 365, in _convert_listlike
arg, format, exact=exact, errors=errors)
File "pandas/tslib.pyx", line 3183, in pandas.tslib.array_strptime (pandas/tslib.c:55388)
**ValueError: time data 'timestamp' does not match format '%d%m%Y' (match)**
如何使用pandas进行此转换?
已编辑2016/07/07
I checked your answer and I get the error
adclicksDF = pd.read_csv('/home/cloudera/Eglence/ad-clicks.csv')
adclicksDF = adclicksDF.rename(columns=lambda x: x.strip())
adclicksDF['adCount'] = 1
useradClick = adclicksDF[['timestamp','userId','adCategory','adCount']]
seradClick.timestamp = pd.to_datetime(useradClick.timestamp)
Traceback (most recent call last):
File "<ipython-input-31-ff9d4c4432ef>", line 1, in <module>
seradClick.timestamp = pd.to_datetime(useradClick.timestamp)
NameError: name 'seradClick' is not defined
useradClick.timestamp = pd.to_datetime(useradClick.timestamp)
/home/cloudera/anaconda3/lib/python3.5/site-packages/pandas/core/generic.py:2698: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
self[name] = value
EDITED
我使用anaconda pandas 0.18.0
import pandas as pd
from pyspark.mllib.clustering import KMeans, KMeansModel
from numpy import array
from pyspark import SparkConf, SparkContext
from pyspark.sql import SQLContext
import sys
conf = (SparkConf()
.setMaster("local")
.setAppName("My app")
.set("spark.executor.memory", "1g"))
sc = SparkContext(conf = conf)
sqlContext = SQLContext(sc)
adclicksDF = pd.read_csv('/home/cloudera/Eglence/ad-clicks.csv')
adclicksDF = adclicksDF.rename(columns=lambda x: x.strip())
adclicksDF['adCount'] = 1
useradClick = adclicksDF[['timestamp','userId','adCategory','adCount']]
useradClick.ix[:,'timestamp'] = p.to_datetime(useradClick.timestamp)
Traceback (most recent call last):
File "<ipython-input-21-dcc10ed41daa>", line 1, in <module>
useradClick.ix[:,'timestamp'] = p.to_datetime(useradClick.timestamp)
NameError: name 'p' is not defined
useradClick.ix[:,'timestamp'] = pd.to_datetime(useradClick.timestamp)
/home/cloudera/anaconda3/lib/python3.5/site-packages/pandas/core/indexing.py:461: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
self.obj[item] = s
答案 0 :(得分:1)
<强>更新强>
cols = ['timestamp','userId','adCategory']
adclicksDF = pd.read_csv('/home/cloudera/Eglence/ad-clicks.csv',
uscols=cols,
parse_dates=['timestamp'],
skipinitialspace=True).assign(adCount=1)
#adclicksDF['adCount'] = 1
原始回答:
如果我猜对了,你不需要像你描述的那样将日期时间转换成字符串。
如果您想按小时分组:
如果您的timestamp
属于object
(字符串)dtype,则应首先将其转换为日期时间:
df.loc[: , 'timestamp'] = pd.to_datetime(df['timestamp'])
In [15]: df
Out[15]:
timestamp userId adCategory adCount
0 2016-05-26 15:13:22 611 electronics 1
1 2016-05-26 15:17:24 1874 movies 1
2 2016-05-26 15:22:52 2139 computers 1
3 2016-05-26 15:22:57 212 fashion 1
4 2016-05-26 15:22:58 1027 clothing 1
5 2016-05-26 16:22:57 111 fashion 1
6 2016-05-26 16:22:58 222 clothing 1
In [16]: df.groupby(pd.Grouper(key='timestamp', freq='1H'))['adCount'].agg(['count','sum'])
Out[16]:
count sum
timestamp
2016-05-26 15:00:00 5 5
2016-05-26 16:00:00 2 2
答案 1 :(得分:0)
Pandas预计格式为'%d%m%Y'(每月),没有空格。 您的格式为2016-05-26 00:00:00'%y-%m-%d%h:%m:%s'。尝试
useradClickv1 = useradClick.select(pd.to_datetime('timestamp',format='%y-%m-%d %h:%m:%s'))