我有两个数据框
dataframe1:
>df_case = dd.read_csv('s3://../.../df_case.csv')
>df_case.head(1)
sacc_id$ id$ creation_date
0 001A000000hwvV0IAI 5001200000ZnfUgAAJ 2016-06-07 14:38:02
dataframe2:
>df_limdata = dd.read_csv('s3://../.../df_limdata.csv')
>df_limdata.head(1)
sacc_id$ opp_line_id$ oppline_creation_date
0 001A000000hAUn8IAG a0W1200000G0i3UEAR 2015-06-10
首先,我做了2个dataframes的合并:
> case = dd.merge(df_limdata, df_case, left_on='sacc_id$',right_on='sacc_id$')
>case
Dask DataFrame Structure:
Unnamed: 0_x sacc_id$ opp_line_id$_x oppline_creation_date_x Unnamed: 0_y opp_line_id$_y oppline_creation_date_y
npartitions=5
int64 object object object int64 object object
... ... ... ... ... ... ...
... ... ... ... ... ... ... ...
... ... ... ... ... ... ...
... ... ... ... ... ... ...
Dask Name: hash-join, 78 tasks
然后,我尝试将这个普通案例数据框转换为熊猫数据框:
> # conversion to pandas
df = case.compute()
我收到此错误:
ValueError: Mismatched dtypes found in `pd.read_csv`/`pd.read_table`.
+------------+---------+----------+
| Column | Found | Expected |
+------------+---------+----------+
| Unnamed: 0 | float64 | int64 |
+------------+---------+----------+
Usually this is due to dask's dtype inference failing, and
*may* be fixed by specifying dtypes manually by adding:
dtype={'Unnamed: 0': 'float64'}
to the call to `read_csv`/`read_table`.
Alternatively, provide `assume_missing=True` to interpret
all unspecified integer columns as floats.
您能帮我解决这个问题吗?
谢谢
答案 0 :(得分:1)
在读取文件dask时,它假定“未命名:0”列的int64为dtype,但后来在计算时却发现它为float64。
因此,您在读取文件时需要提及dtype:
Dim i As Long