我有一个熊猫数据框,其中的分类列包含NaN值,例如:
g = pd.Series(["A", "B", "C", np.nan], dtype="category")
g
0 A
1 B
2 C
3 NaN
dtype: category
Categories (3, object): [A, B, C]
在熊猫中,NaN不是类别,但是您可以在分类数据中包含NaN值。我想在Jupyter笔记本中使用%% R将此数据帧传递给R。 R成功地将分类列识别为一个因素,但是该因素的格式不正确,大概是因为Nan值:
%%R -i g
str(g)
Factor w/ 3 levels "A","B","C": 1 2 3 0
- attr(*, "names")= chr [1:4] "0" "1" "2" "3"
print(g)
Error in as.character.factor(x) : malformed factor
有什么方法可以确保该因素没有格式错误-例如有一个 是否自动创建NA因子水平?
R:3.5.1,rpy2:2.9.4,Python-3
答案 0 :(得分:0)
在撰写本文时,这是rpy2转换熊猫类别的错误,该错误已修复,将从2.9.5版开始包含在rpy2中:https://bitbucket.org/rpy2/rpy2/issues/493/rpy2-conversion-of-categorical-data
一种解决方法非常简单:不要在熊猫类别中使用NaN
。
g = pd.Series(["A", "B", "C", np.nan], dtype="category")
# Prepare alternative representation to pass it to R
g_r = g.replace(np.nan, 'Missing')
转换时,它看起来像:
%%R -i g_r
str(g_r)
Factor w/ 4 levels "A","B","C","Missing": 1 2 3 4
- attr(*, "names")= chr [1:4] "0" "1" "2" "3"
转换回R NA只是删除添加的级别的问题:
%%R -i g_r
str(droplevels(g_r, exclude = "Missing"))
Factor w/ 3 levels "A","B","C": 1 2 3 NA
- attr(*, "names")= chr [1:4] "0" "1" "2" "3"