熊猫时间戳转换

时间:2016-07-07 17:39:03

标签: python pandas timestamp data-conversion

我和熊猫一起工作。我有以下数据:

useradClick.head(n=5)
Out[291]: 
             timestamp  userId   adCategory  adCount
0  2016-05-26 15:13:22     611  electronics        1
1  2016-05-26 15:17:24    1874       movies        1
2  2016-05-26 15:22:52    2139    computers        1
3  2016-05-26 15:22:57     212      fashion        1
4  2016-05-26 15:22:58    1027     clothing        1

我想将2016-05-26 15:13:22转换为2016-05-26 15.之后我想做一个小组

我试过

useradClickv1 = useradClick.select(pd.to_datetime('timestamp',format='%d%m%Y'))

但是我收到了错误

Traceback (most recent call last):

  File "<ipython-input-292-9d5a6a59d577>", line 1, in <module>
    useradClickv1 = useradClick.select(pd.to_datetime('timestamp',format='%d%m%Y'))

  File "/home/cloudera/anaconda3/lib/python3.5/site-packages/pandas/util/decorators.py", line 91, in wrapper
    return func(*args, **kwargs)

  File "/home/cloudera/anaconda3/lib/python3.5/site-packages/pandas/tseries/tools.py", line 287, in to_datetime
    unit=unit, infer_datetime_format=infer_datetime_format)

  File "/home/cloudera/anaconda3/lib/python3.5/site-packages/pandas/tseries/tools.py", line 416, in _to_datetime
    return _convert_listlike(np.array([arg]), box, format)[0]

  File "/home/cloudera/anaconda3/lib/python3.5/site-packages/pandas/tseries/tools.py", line 402, in _convert_listlike
    raise e

  File "/home/cloudera/anaconda3/lib/python3.5/site-packages/pandas/tseries/tools.py", line 365, in _convert_listlike
    arg, format, exact=exact, errors=errors)

  File "pandas/tslib.pyx", line 3183, in pandas.tslib.array_strptime (pandas/tslib.c:55388)

**ValueError: time data 'timestamp' does not match format '%d%m%Y' (match)**

如何使用pandas进行此转换?

已编辑2016/07/07

I checked your answer and I get the error
adclicksDF = pd.read_csv('/home/cloudera/Eglence/ad-clicks.csv')

adclicksDF = adclicksDF.rename(columns=lambda x: x.strip())

adclicksDF['adCount'] = 1

useradClick = adclicksDF[['timestamp','userId','adCategory','adCount']]

seradClick.timestamp = pd.to_datetime(useradClick.timestamp)
Traceback (most recent call last):

  File "<ipython-input-31-ff9d4c4432ef>", line 1, in <module>
    seradClick.timestamp = pd.to_datetime(useradClick.timestamp)

NameError: name 'seradClick' is not defined


useradClick.timestamp = pd.to_datetime(useradClick.timestamp)
/home/cloudera/anaconda3/lib/python3.5/site-packages/pandas/core/generic.py:2698: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self[name] = value

EDITED

我使用anaconda pandas 0.18.0

import pandas as pd

from pyspark.mllib.clustering import KMeans, KMeansModel

from numpy import array

from pyspark import SparkConf, SparkContext

from pyspark.sql import SQLContext

import sys

conf = (SparkConf()
         .setMaster("local")
         .setAppName("My app")
         .set("spark.executor.memory", "1g"))

sc          = SparkContext(conf = conf)


sqlContext  = SQLContext(sc)

adclicksDF = pd.read_csv('/home/cloudera/Eglence/ad-clicks.csv')

adclicksDF = adclicksDF.rename(columns=lambda x: x.strip())

adclicksDF['adCount'] = 1 

useradClick = adclicksDF[['timestamp','userId','adCategory','adCount']]

useradClick.ix[:,'timestamp'] = p.to_datetime(useradClick.timestamp)
Traceback (most recent call last):

  File "<ipython-input-21-dcc10ed41daa>", line 1, in <module>
    useradClick.ix[:,'timestamp'] = p.to_datetime(useradClick.timestamp)

NameError: name 'p' is not defined


useradClick.ix[:,'timestamp'] = pd.to_datetime(useradClick.timestamp)
/home/cloudera/anaconda3/lib/python3.5/site-packages/pandas/core/indexing.py:461: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item] = s

2 个答案:

答案 0 :(得分:1)

<强>更新

cols = ['timestamp','userId','adCategory']
adclicksDF = pd.read_csv('/home/cloudera/Eglence/ad-clicks.csv',
                         uscols=cols,
                         parse_dates=['timestamp'],
                         skipinitialspace=True).assign(adCount=1)
#adclicksDF['adCount'] = 1

原始回答:

如果我猜对了,你不需要像你描述的那样将日期时间转换成字符串。

如果您想按小时分组:

如果您的timestamp属于object(字符串)dtype,则应首先将其转换为日期时间:

df.loc[: , 'timestamp'] = pd.to_datetime(df['timestamp'])

In [15]: df
Out[15]:
            timestamp  userId   adCategory  adCount
0 2016-05-26 15:13:22     611  electronics        1
1 2016-05-26 15:17:24    1874       movies        1
2 2016-05-26 15:22:52    2139    computers        1
3 2016-05-26 15:22:57     212      fashion        1
4 2016-05-26 15:22:58    1027     clothing        1
5 2016-05-26 16:22:57     111      fashion        1
6 2016-05-26 16:22:58     222     clothing        1

In [16]: df.groupby(pd.Grouper(key='timestamp', freq='1H'))['adCount'].agg(['count','sum'])
Out[16]:
                     count  sum
timestamp
2016-05-26 15:00:00      5    5
2016-05-26 16:00:00      2    2

答案 1 :(得分:0)

Pandas预计格式为'%d%m%Y'(每月),没有空格。 您的格式为2016-05-26 00:00:00'%y-%m-%d%h:%m:%s'。尝试

useradClickv1 = useradClick.select(pd.to_datetime('timestamp',format='%y-%m-%d %h:%m:%s'))