Dask数据帧没有属性' _meta_nonempty'同时在Python中合并大型CSV

时间:2016-11-30 03:07:30

标签: python pandas dask

我试过Pandas:

import pandas as pd
df1 = pd.read_csv("csv1.csv")
df2 = pd.read_csv("csv2.csv")
my_keys = ["my_id", "my_subid"]
joined_df = pd.merge(df1, df1, on=my_keys)
joined_df.to_csv('out_df.csv', index=False)

经过一些研磨后出现内存错误。

接下来我尝试了Dask:

import dask.dataframe as dd

ddf1 = dd.read_csv("csv1.csv")
ddf2 = dd.read_csv("csv2.csv")
my_keys = ["my_id", "my_subid"]
joined_ddf = dd.merge(ddf1, ddf2, on=[my_keys])
joined_ddf.to_csv('out_ddf.csv', index=False)

我得到了相当神秘的内容:

'DataFrame' object has no attribute '_meta_nonempty'

可能会发生the doc次提及(由于昂贵的类型推断或Pandas中发生的事情,我猜错了)。但是在使用pandas中的类型手动设置元数据后,尝试from_pandas()并且没有到达任何地方我认为Dask不是最佳选择。

下一步是什么?如果没有元数据技巧,最好使用sqlalchemydf.to_sql将连接卸载到外部数据库中?由于连接中有多个索引,我远离普通csv模块。

1 个答案:

答案 0 :(得分:0)

跟进:倾销到Postgres是相当轻松的,虽然数据帧对我来说仍然看起来更干净。

import pandas as pd
from sqlalchemy import create_engine

df1 = pd.read_csv("csv1.csv")
df2 = pd.read_csv("csv2.csv")

engine = create_engine('postgresql://user:passwd@localhost:5432/mydb')
df1.to_sql('tableOne', engine)
df2.to_sql('tableTwo', engine)

query = """
  SELECT *
  FROM tableOne AS one
  INNER JOIN tableTwo AS two
  ON one.subject_id=two.subject_id
  AND one.subject_sub_id=two.subject_sub_id
  ORDER BY
  one.subject_id,
  one.subject_id
  """
df_result = pd.read_sql_query(query, engine)
df_result.to_sql('resultTable', engine)
df_result.to_csv("join_result.csv")

将来必须再次尝试Dask。