我正在尝试使用dask read_csv读取多个CSV文件,每个文件大约15 GB。在执行此任务时,dask会将特定的列解释为float,但是它具有一些字符串类型的值,后来当我尝试执行一些操作以表明它无法将字符串转换为float时,它会失败。因此,我使用dtype = str参数将所有列读取为字符串。现在,我想将特定的列转换为带有errors ='coerce'的数值,以便将那些包含字符串的记录转换为NaN值,其余的转换为正确的浮点数。您能告诉我如何使用dask来实现吗?
已经尝试过:类型转换
import dask.dataframe as dd
df = dd.read_csv("./*.csv", encoding='utf8',
assume_missing = True,
usecols =col_names.values.tolist(),
dtype=str)
df["mycol"] = df["mycol"].astype(float)
search_df = df.query('mycol >0').compute()
ValueError: Mismatched dtypes found in `pd.read_csv`/`pd.read_table`.
+-----------------------------------+--------+----------+
| Column | Found | Expected |
+-----------------------------------+--------+----------+
| mycol | object | float64 |
+-----------------------------------+--------+----------+
The following columns also raised exceptions on conversion:
- mycol
ValueError("could not convert string to float: 'cliqz.com/tracking'")
#Reproducible example
import dask.dataframe as dd
df = dd.read_csv("mydata.csv", encoding='utf8',
assume_missing = True)
df.dtypes #count column will appear as float but it has a couple of dirty values as string
search_df = df.query('count >0').compute() #This line will give the type conversion error
#Edit with one possible solution, but is this optimal while using dask?
import dask.dataframe as dd
import pandas as pd
to_n = lambda x: pd.to_numeric(x, errors="coerce")
df = dd.read_csv("mydata.csv", encoding='utf8',
assume_missing = True,
converters={"count":to_n}
)
df.dtypes
search_df = df.query('count >0').compute()
答案 0 :(得分:0)
我有一个类似的问题,我使用.where
解决了。
p = ddf.from_pandas(pandas.Series(["1", "2", np.nan, "3", "4"]), 1)
p.where(~p.isna(), 999).astype("u4")
或者也许将第二行替换为:
p.where(p.str.isnumeric(), 999).astype("u4")
在我的情况下,我的DataFrame
(或Series
)是其他操作的结果,因此我无法将其直接应用于read_csv
。