Dad等效于pd.to_numeric

时间:2019-06-26 10:58:30

标签: dask dask-distributed

我正在尝试使用dask read_csv读取多个CSV文件,每个文件大约15 GB。在执行此任务时,dask会将特定的列解释为float,但是它具有一些字符串类型的值,后来当我尝试执行一些操作以表明它无法将字符串转换为float时,它会失败。因此,我使用dtype = str参数将所有列读取为字符串。现在,我想将特定的列转换为带有errors ='coerce'的数值,以便将那些包含字符串的记录转换为NaN值,其余的转换为正确的浮点数。您能告诉我如何使用dask来实现吗?

已经尝试过:类型转换

import dask.dataframe as dd
df = dd.read_csv("./*.csv", encoding='utf8', 
                 assume_missing = True, 
                 usecols =col_names.values.tolist(),
                    dtype=str)
df["mycol"] = df["mycol"].astype(float)
search_df = df.query('mycol >0').compute()
ValueError: Mismatched dtypes found in `pd.read_csv`/`pd.read_table`.

+-----------------------------------+--------+----------+
| Column                            | Found  | Expected |
+-----------------------------------+--------+----------+
| mycol                             | object | float64  |
+-----------------------------------+--------+----------+

The following columns also raised exceptions on conversion:

- mycol
  ValueError("could not convert string to float: 'cliqz.com/tracking'")
#Reproducible example
import dask.dataframe as dd
df = dd.read_csv("mydata.csv", encoding='utf8', 
                 assume_missing = True)
df.dtypes #count column will appear as float but it has a couple of dirty values as string
search_df = df.query('count >0').compute() #This line will give the type conversion error
#Edit with one possible solution, but is this optimal while using dask?
import dask.dataframe as dd
import pandas as pd
to_n = lambda x: pd.to_numeric(x, errors="coerce")
df = dd.read_csv("mydata.csv", encoding='utf8', 
                 assume_missing = True,
                 converters={"count":to_n}
                )
df.dtypes 
search_df = df.query('count >0').compute() 

1 个答案:

答案 0 :(得分:0)

我有一个类似的问题,我使用.where解决了。

p = ddf.from_pandas(pandas.Series(["1", "2", np.nan, "3", "4"]), 1)
p.where(~p.isna(), 999).astype("u4")

或者也许将第二行替换为:

p.where(p.str.isnumeric(), 999).astype("u4")

在我的情况下,我的DataFrame(或Series)是其他操作的结果,因此我无法将其直接应用于read_csv