Rpy2将包含空值的分类数据转换为R因数

时间:2018-11-15 04:02:06

标签: r pandas rpy2 categorical-data factors

我有一个熊猫数据框,其中的分类列包含NaN值,例如:

g = pd.Series(["A", "B", "C", np.nan], dtype="category")
g

0      A
1      B
2      C
3    NaN
dtype: category
Categories (3, object): [A, B, C]

在熊猫中,NaN不是类别,但是您可以在分类数据中包含NaN值。我想在Jupyter笔记本中使用%% R将此数据帧传递给R。 R成功地将分类列识别为一个因素,但是该因素的格式不正确,大概是因为Nan值:

%%R -i g
str(g)
Factor w/ 3 levels "A","B","C": 1 2 3 0
 - attr(*, "names")= chr [1:4] "0" "1" "2" "3" 

print(g)
Error in as.character.factor(x) : malformed factor

有什么方法可以确保该因素没有格式错误-例如有一个 是否自动创建NA因子水平?

R:3.5.1,rpy2:2.9.4,Python-3

1 个答案:

答案 0 :(得分:0)

在撰写本文时,这是rpy2转换熊猫类别的错误,该错误已修复,将从2.9.5版开始包含在rpy2中:https://bitbucket.org/rpy2/rpy2/issues/493/rpy2-conversion-of-categorical-data

一种解决方法非常简单:不要在熊猫类别中使用NaN

g = pd.Series(["A", "B", "C", np.nan], dtype="category")
# Prepare alternative representation to pass it to R
g_r = g.replace(np.nan, 'Missing')

转换时,它看起来像:

%%R -i g_r
str(g_r)

Factor w/ 4 levels "A","B","C","Missing": 1 2 3 4
- attr(*, "names")= chr [1:4] "0" "1" "2" "3"

转换回R NA只是删除添加的级别的问题:

%%R -i g_r
str(droplevels(g_r, exclude = "Missing")) 

Factor w/ 3 levels "A","B","C": 1 2 3 NA
- attr(*, "names")= chr [1:4] "0" "1" "2" "3"