Question

我正在尝试使用dask read_csv读取多个CSV文件，每个文件大约15 GB。在执行此任务时，dask会将特定的列解释为float，但是它具有一些字符串类型的值，后来当我尝试执行一些操作以表明它无法将字符串转换为float时，它会失败。因此，我使用dtype = str参数将所有列读取为字符串。现在，我想将特定的列转换为带有errors ='coerce'的数值，以便将那些包含字符串的记录转换为NaN值，其余的转换为正确的浮点数。您能告诉我如何使用dask来实现吗？

已经尝试过：类型转换

import dask.dataframe as dd
df = dd.read_csv("./*.csv", encoding='utf8', 
                 assume_missing = True, 
                 usecols =col_names.values.tolist(),
                    dtype=str)
df["mycol"] = df["mycol"].astype(float)
search_df = df.query('mycol >0').compute()

ValueError: Mismatched dtypes found in `pd.read_csv`/`pd.read_table`.

+-----------------------------------+--------+----------+
| Column                            | Found  | Expected |
+-----------------------------------+--------+----------+
| mycol                             | object | float64  |
+-----------------------------------+--------+----------+

The following columns also raised exceptions on conversion:

- mycol
  ValueError("could not convert string to float: 'cliqz.com/tracking'")

#Reproducible example
import dask.dataframe as dd
df = dd.read_csv("mydata.csv", encoding='utf8', 
                 assume_missing = True)
df.dtypes #count column will appear as float but it has a couple of dirty values as string
search_df = df.query('count >0').compute() #This line will give the type conversion error

#Edit with one possible solution, but is this optimal while using dask?
import dask.dataframe as dd
import pandas as pd
to_n = lambda x: pd.to_numeric(x, errors="coerce")
df = dd.read_csv("mydata.csv", encoding='utf8', 
                 assume_missing = True,
                 converters={"count":to_n}
                )
df.dtypes 
search_df = df.query('count >0').compute()

Answer 1

我有一个类似的问题，我使用.where解决了。

p = ddf.from_pandas(pandas.Series(["1", "2", np.nan, "3", "4"]), 1)
p.where(~p.isna(), 999).astype("u4")

或者也许将第二行替换为：

p.where(p.str.isnumeric(), 999).astype("u4")

在我的情况下，我的DataFrame（或Series）是其他操作的结果，因此我无法将其直接应用于read_csv。

Dad等效于pd.to_numeric

1 个答案: