ValueError:无法将DatetimeIndex强制转换为dtype datetime64 [us]

时间:2016-07-22 00:08:31

标签: python postgresql pandas

我试图为S& P 500 ETF创建30分钟数据的PostgreSQL表 (spy30new,用于测试新插入的数据)来自几个15分钟数据的股票表(全部15)。 all15有一个指数' dt' (时间戳)和' instr' (股票代码)。我希望spy30new能够在' dt'。

上找到索引
import numpy as np
import pandas as pd
from datetime import datetime, date, time, timedelta
from dateutil import parser
from sqlalchemy import create_engine

# Query all15
engine = create_engine('postgresql://user:passwd@localhost:5432/stocks')
new15Df = (pd.read_sql_query("SELECT dt, o, h, l, c, v FROM all15 WHERE (instr = 'SPY') AND (date(dt) BETWEEN '2016-06-27' AND '2016-07-15');", engine)).sort_values('dt')
# Correct for Time Zone.
new15Df['dt'] = (new15Df['dt'].copy()).apply(lambda d: d + timedelta(hours=-4))

# spy0030Df contains the 15-minute data at 00 & 30 minute time points
# spy1545Df contains the 15-minute data at 15 & 45 minute time points
spy0030Df = (new15Df[new15Df['dt'].apply(lambda d: d.minute % 30) == 0]).reset_index(drop=True)
spy1545Df = (new15Df[new15Df['dt'].apply(lambda d: d.minute % 30) == 15]).reset_index(drop=True)

high = pd.concat([spy1545Df['h'], spy0030Df['h']], axis=1).max(axis=1)
low = pd.concat([spy1545Df['l'], spy0030Df['l']], axis=1).min(axis=1)
volume = spy1545Df['v'] + spy0030Df['v']

# spy30Df assembled and pushed to PostgreSQL as table spy30new
spy30Df = pd.concat([spy0030Df['dt'], spy1545Df['o'], high, low, spy0030Df['c'], volume], ignore_index = True, axis=1)
spy30Df.columns = ['d', 'o', 'h', 'l', 'c', 'v']
spy30Df.set_index(['dt'], inplace=True)
spy30Df.to_sql('spy30new', engine, if_exists='append', index_label='dt')

这会给出错误" ValueError:无法将DatetimeIndex强制转换为dtype datetime64 [us]"
到目前为止我尝试过的(我已经使用pandas成功地将CSV文件推送到了PG。但这里的源代码是PG数据库):

  1. 未在'dt'

    上放置索引
    spy30Df.set_index(['dt'], inplace=True)  # Remove this line
    spy30Df.to_sql('spy30new', engine, if_exists='append')  # Delete the index_label option
    
  2. 转换' dt'从类型pandas.tslib.Timestamp到datetime.datetime使用to_pydatetime() (如果psycopg2可以使用python dt,但不能使用pandas Timestamp)

    u = (spy0030Df['dt']).tolist()
    timesAsPyDt = np.asarray(map((lambda d: d.to_pydatetime()), u))
    spy30Df = pd.concat([spy1545Df['o'], high, low, spy0030Df['c'], volume], ignore_index = True, axis=1)
    newArray = np.c_[timesAsPyDt, spy30Df.values]
    colNames = ['dt', 'o', 'h', 'l', 'c', 'v']
    newDf = pd.DataFrame(newArray, columns=colNames)
    newDf.set_index(['dt'], inplace=True)
    newDf.to_sql('spy30new', engine, if_exists='append', index_label='dt')
    
  3. 使用datetime.utcfromtimestamp()

    timesAsDt = (spy0030Df['dt']).apply(lambda d: datetime.utcfromtimestamp(d.tolist()/1e9))
    
  4. 使用pd.to_datetime()

    timesAsDt = pd.to_datetime(spy0030Df['dt'])
    

3 个答案:

答案 0 :(得分:7)

对每个元素使用pd.to_datetime()。选项4不起作用,将pd.to_datetime()应用于整个系列。也许Postgres的驱动程序理解python datetime,但不是大熊猫的日期时间64。 numpy的。选项4产生了正确的输出,但是在将DF发送到Postgres

时我得到了ValueError(参见标题)
__toString()

答案 1 :(得分:4)

实际上,这是我的数据框。

                              Biomass  Fossil Brown coal/Lignite  Fossil Coal-derived gas  Fossil Gas  Fossil Hard coal  Fossil Oil  Geothermal  Hydro Pumped Storage  Hydro Run-of-river and poundage  Hydro Water Reservoir  Nuclear   Other  Other renewable    Solar  Waste  Wind Offshore  Wind Onshore
2018-02-02 00:00:00+01:00   4835.0                    16275.0                    446.0      1013.0            4071.0       155.0         5.0                   7.0                           1906.0                   35.0   8924.0  3643.0            142.0      0.0  595.0         2517.0       19999.0
2018-02-02 00:15:00+01:00   4834.0                    16272.0                    446.0      1010.0            3983.0       155.0         5.0                   7.0                           1908.0                   71.0   8996.0  3878.0            142.0      0.0  594.0         2364.0       19854.0
2018-02-02 00:30:00+01:00   4828.0                    16393.0                    446.0      1019.0            4015.0       155.0         5.0    

我试图插入SQL数据库,但遇到与上述问题相同的错误。我所做的是,将数据框的索引转换为带有标签“索引”的列。

df.reset_index(level=0, inplace=True)  

使用此代码将列名“ index”重命名为“ DateTime”。

df = df.rename(columns={'index': 'DateTime'})

将数据类型更改为“ datetime64”。

df['DateTime'] = df['DateTime'].astype('datetime64')

使用这些代码将其存储在sql数据库中。

engine = create_engine('mysql+mysqlconnector://root:Password@localhost/generation_data', echo=True)
df.to_sql(con=engine, name='test', if_exists='replace')

答案 2 :(得分:3)

我遇到了同样的问题,并且对每个元素都应用了pd.to_datetime()。但它比整个系列上运行pd.to_datetime()慢几个数量级。对于行数超过100万的数据框:

(df['Time']).apply(lambda d: pd.to_datetime(str(d)))

需要大约70秒

pd.to_datetime(df['Time'])

需要大约0.01秒

实际问题是正在包含时区信息。要删除它:

t = pd.to_datetime(df['Time'])
t = t.tz_localize(None)

这应该快得多!