我正在尝试使用急流中的cudf处理特定的csv文件。可以通过以下链接获取文件: http://open-data-assurance-maladie.ameli.fr/depenses/download.php?Dir_Rep=Open_DAMIR&Annee=2018 我已经尝试过文件 A2018_01.csv (输入“données”,然后按验证下载) 据我了解,cudf API的使用就像熊猫一样,因此我尝试先用熊猫阅读csv:
import os
import pandas as pd
PATH='data/'
df = pd.read_csv(f'{PATH}A2018_01.csv', sep=";")
这在我的机器上大约需要2分钟。
df.describe()
FLX_ANN_MOI ORG_CLE_REG AGE_BEN_SNDS BEN_RES_REG BEN_CMU_TOP BEN_QLT_COD BEN_SEX_COD DDP_SPE_COD ETE_CAT_SNDS ETE_REG_COD ... PSE_ACT_CAT PSE_SPE_SNDS PSE_STJ_SNDS PRE_INS_REG PSP_ACT_SNDS PSP_ACT_CAT PSP_SPE_SNDS PSP_STJ_SNDS TOP_PS5_TRG Unnamed: 55
count 34003028.0 3.400303e+07 3.400303e+07 3.400303e+07 3.400303e+07 3.400303e+07 3.400303e+07 3.400303e+07 3.400303e+07 3.400303e+07 ... 3.400303e+07 3.400303e+07 3.400303e+07 3.400303e+07 3.400303e+07 3.400303e+07 3.400303e+07 3.400303e+07 3.400303e+07 0.0
mean 201801.0 5.006560e+01 4.662571e+01 5.283103e+01 3.734231e+00 1.277093e+00 1.561921e+00 6.041314e+01 8.242066e+03 8.909063e+01 ... 3.681443e+00 6.550268e+00 2.915378e+00 7.256596e+01 1.230369e+00 9.164879e+00 1.814533e+01 3.557239e+00 4.214655e+00 NaN
std 0.0 3.207707e+01 2.420884e+01 2.963844e+01 4.360434e+00 6.316855e-01 4.961724e-01 6.001130e+01 3.497950e+03 2.353433e+01 ... 1.166599e+01 2.003136e+01 3.269808e+00 3.205104e+01 7.872794e+00 2.747816e+01 3.260440e+01 3.433198e+00 3.958894e+00 NaN
min 201801.0 5.000000e+00 0.000000e+00 5.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 1.101000e+03 5.000000e+00 ... 0.000000e+00 0.000000e+00 1.000000e+00 5.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 1.000000e+00 0.000000e+00 NaN
25% 201801.0 2.400000e+01 3.000000e+01 2.700000e+01 0.000000e+00 1.000000e+00 1.000000e+00 0.000000e+00 9.999000e+03 9.900000e+01 ... 1.000000e+00 0.000000e+00 1.000000e+00 4.400000e+01 0.000000e+00 0.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 NaN
50% 201801.0 4.400000e+01 5.000000e+01 5.200000e+01 0.000000e+00 1.000000e+00 2.000000e+00 4.300000e+01 9.999000e+03 9.900000e+01 ... 2.000000e+00 0.000000e+00 1.000000e+00 9.300000e+01 0.000000e+00 1.000000e+00 1.000000e+00 2.000000e+00 1.000000e+00 NaN
75% 201801.0 7.600000e+01 7.000000e+01 7.600000e+01 9.000000e+00 1.000000e+00 2.000000e+00 1.210000e+02 9.999000e+03 9.900000e+01 ... 3.000000e+00 1.000000e+00 2.000000e+00 9.900000e+01 0.000000e+00 1.000000e+00 1.400000e+01 9.000000e+00 9.000000e+00 NaN
max 201801.0 9.900000e+01 9.900000e+01 9.900000e+01 9.000000e+00 9.000000e+00 2.000000e+00 1.210000e+02 9.999000e+03 9.900000e+01 ... 9.900000e+01 9.900000e+01 9.000000e+00 9.900000e+01 9.900000e+01 9.900000e+01 9.900000e+01 9.000000e+00 9.000000e+00 NaN
然后我尝试了cudf:
import cudf; print('cuDF Version'+ cudf.__version__)
gdf = cudf.read_csv(f'{PATH}A2018_01.csv', sep=";")
cuDF版本0.8.0 + 0.g8fa7bd3.dirty
它开始加载,但并没有停止,只是在我的jupyter笔记本电脑单元旁边显示了星号(*)。 有什么想法我应该做些什么才能使其起作用? 顺便说一句,我正在使用Ubuntu 18.04.2 LTS和'GeForce RTX 2080 Ti'作为GPU,到目前为止似乎可以正常使用,例如pytorch没问题。