我已经获得了一些我需要读入dask数据帧的大型sqlite表。这些表具有以日期时间(ISO格式的字符串)存储为sqlite NUMERIC数据类型的列。我能够使用Pandas'读取这种数据。 read_sql_table。但是,来自dask的同一调用会产生错误。有人可以建议一个好的解决方法吗? (我不知道将这些列的sqlite数据类型从NUMERIC更改为TEXT的简单方法。)我在下面粘贴了一个最小的例子。
import sqlalchemy
import pandas as pd
import dask.dataframe as ddf
connString = "sqlite:///c:\\temp\\test.db"
engine = sqlalchemy.create_engine(connString)
conn = engine.connect()
conn.execute("create table testtable (uid integer Primary Key, datetime NUM)")
conn.execute("insert into testtable values (1, '2017-08-03 01:11:31')")
print(conn.execute('PRAGMA table_info(testtable)').fetchall())
conn.close()
pandasDF = pd.read_sql_table('testtable', connString, index_col='uid', parse_dates={'datetime':'%Y-%m-%d %H:%M:%S'})
pandasDF.head()
daskDF = ddf.read_sql_table('testtable', connString, index_col='uid', parse_dates={'datetime':'%Y-%m-%d %H:%M:%S'})
这是追溯:
Warning (from warnings module):
File "C:\Program Files\Python36\lib\site-packages\sqlalchemy\sql\sqltypes.py", line 596
'storage.' % (dialect.name, dialect.driver))
SAWarning: Dialect sqlite+pysqlite does *not* support Decimal objects natively, and SQLAlchemy must convert from floating point - rounding errors and other issues may occur. Please consider storing Decimal numbers as strings or integers on this platform for lossless storage.
Traceback (most recent call last):
File "<pyshell#2>", line 1, in <module>
daskDF = ddf.read_sql_table('testtable', connString, index_col='uid', parse_dates={'datetime':'%Y-%m-%d %H:%M:%S'})
File "C:\Program Files\Python36\lib\site-packages\dask\dataframe\io\sql.py", line 98, in read_sql_table
head = pd.read_sql(q, engine, **kwargs)
File "C:\Program Files\Python36\lib\site-packages\pandas\io\sql.py", line 416, in read_sql
chunksize=chunksize)
File "C:\Program Files\Python36\lib\site-packages\pandas\io\sql.py", line 1104, in read_query
parse_dates=parse_dates)
File "C:\Program Files\Python36\lib\site-packages\pandas\io\sql.py", line 157, in _wrap_result
coerce_float=coerce_float)
File "C:\Program Files\Python36\lib\site-packages\pandas\core\frame.py", line 1142, in from_records
coerce_float=coerce_float)
File "C:\Program Files\Python36\lib\site-packages\pandas\core\frame.py", line 6304, in _to_arrays
data = lmap(tuple, data)
File "C:\Program Files\Python36\lib\site-packages\pandas\compat\__init__.py", line 129, in lmap
return list(map(*args, **kwargs))
TypeError: must be real number, not str
编辑:@mdurant的评论让我想知道这是否是sqlalchemy中的一个错误。以下代码提供与pandas相同的错误消息:
import sqlalchemy as sa
from sqlalchemy import text
m = sa.MetaData()
table = sa.Table('testtable', m, autoload=True, autoload_with=engine)
resultList = conn.execute(sa.sql.select(table.columns).select_from(table)).fetchall()
print(resultList)
resultList2 = conn.execute(sa.sql.select(columns=[text('uid'),text('datetime')], from_obj = text('testtable'))).fetchall()
print(resultList2)
Traceback (most recent call last):
File "<ipython-input-20-188c84a35d95>", line 1, in <module>
print(resultList)
File "c:\program files\python36\lib\site-packages\sqlalchemy\engine\result.py", line 156, in __repr__
return repr(sql_util._repr_row(self))
File "c:\program files\python36\lib\site-packages\sqlalchemy\sql\util.py", line 329, in __repr__
", ".join(trunc(value) for value in self.row),
TypeError: must be real number, not str
答案 0 :(得分:2)
令人费解。 以下是一些进一步的信息,希望可以得到答案。
正在相关行执行的查询是
pd.read_sql(sql.select(table.columns).select_from(table),
engine, index_col='uid')
在您显示时失败(此处limit
不相关)。
但是,同一查询的文本版本
sql.select(table.columns).select_from(table).compile().string
-> 'SELECT testtable.uid, testtable.datetime \nFROM testtable'
pd.read_sql('SELECT testtable.uid, testtable.datetime \nFROM testtable',
engine, index_col='uid') # works fine
以下解决方法(在查询中使用强制转换)确实有效(但不是很漂亮):
import sqlalchemy as sa
engine = sa.create_engine(connString)
table = sa.Table('testtable', m, autoload=True, autoload_with=engine)
uid, dt = list(table.columns)
q = sa.select([dt.cast(sa.types.String)]).select_from(table)
daskDF = ddf.read_sql_table(q, connString, index_col=uid.label('uid'))
-edit-
更简单的形式似乎也有效(见评论)
daskDF = ddf.read_sql_table('testtable', connString, index_col='uid',
columns=['uid', sa.sql.column('datetime').cast(sa.types.String).label('datetime')])