将dask数据帧保存到csv时如何纠正错误?

时间:2019-04-26 23:41:48

标签: python pandas dataframe dask

当我尝试将dask数据帧保存到csv时,我总是收到错误消息。简而言之,我有一个由10列和20行组成的pandas df,然后加载了350列和6+百万行(〜6GB)的daf df。我需要对熊猫df做一个相当简单的左连接。完成该连接后,我使用final.dtypes查看最后一个dask df的数据类型,它显示了12列,正如我希望的那样。但是,当我尝试将名为final的daf df转换为.csv时,即使它们不在最终表中,我仍会收到一个错误,该错误指向dask_df中的列。怎么回事,我该如何纠正?如有必要,我可以提供示例数据。

错误消息:

Usually this is due to dask's dtype inference failing, and
*may* be fixed by specifying dtypes manually by adding:
dtype={'Authorized Official Telephone Number': 'object',
       'Other Provider Identifier Issuer_33': 'object',
       'Other Provider Identifier Issuer_34': 'object',
       'Other Provider Identifier Issuer_35': 'object',
       'Other Provider Identifier Issuer_36': 'object',
       'Other Provider Identifier Issuer_37': 'object',
       'Other Provider Identifier Issuer_39': 'object',
       'Other Provider Identifier Issuer_40': 'object',
       'Other Provider Identifier Issuer_41': 'object',
       'Other Provider Identifier Issuer_42': 'object',
       'Other Provider Identifier Issuer_43': 'object',
       'Other Provider Identifier Issuer_44': 'object',
       'Other Provider Identifier Issuer_45': 'object',
       'Other Provider Identifier Issuer_46': 'object',
       'Other Provider Identifier Issuer_47': 'object',
       'Other Provider Identifier Issuer_48': 'object',
       'Other Provider Identifier Issuer_49': 'object',
       'Other Provider Identifier_37': 'object',
       'Other Provider Identifier_48': 'object',
       'Other Provider Identifier_49': 'object',
       'Provider Business Mailing Address Fax Number': 'object',
       'Provider Business Practice Location Address Fax Number': 'object'}

to the call to `read_csv`/`read_table`.

我的代码:

import dask.dataframe as dd
import pandas as pd

pandas_df = dd.read_csv('small_table.csv')

dask_df = dd.read_csv('npidata_pfile_20050523-20190407.csv',low_memory=False,dtype=str)

final= dd.merge(pandas_df, dask_df[['NPI','Provider First Name']], how='left', left_on='Physician NPI',right_on='NPI')

final.to_csv('e.csv')

2 个答案:

答案 0 :(得分:1)

您正在传递dtype = str,但我认为也许您应该传递d​​type = object,这就是Pandas用来表示实际上任何非数字数据的原因。

dask.dataframe.read_csv函数给您一条错误消息,鼓励您使用dtype = object。实际上,这是给您完整的dtype={...}指示,您可以通过它使错误消息中的内容正常工作。

答案 1 :(得分:1)

如果您确实不需要任何这些列,则可以通过将ui dist index.js backend src main java org example Application.java 传递到public void addResourceHandlers(ResourceHandlerRegistry registry) { registry.addResourceHandler("/ui/*") .addResourceLocations("file:///../ui/dist"); } 仅包括您需要的列来简单地排除它们。 / p>