如何在SparkR中为多列正确使用ft_string_indexer和ft_one_hot_encoder

时间:2018-07-26 16:56:15

标签: r apache-spark sparklyr

我有两个问题:

  1. 如何在Spark中将多个分类变量转换为虚拟变量的大矩阵?

  2. 如何使用one_hot_encoder获得正确的输出并运行(逻辑)回归?

我对如何使用ft_string_indexerft_one_hot_encoder获得正确的tbl感到困惑。

作为一个例子,我制作了当前数据框:

library(sparklyr)
library(tidyverse)

sc <- spark_connect(master="yarn-client", spark_home =Sys.getenv("SPARK_HOME"), app_name = "sparklyr",
                    version = "2.1.2", hadoop_version = "2.6", config = configs)

df <- data.frame(
  a=rep(letters[1:4],5), 
  b=rep(c("one", "two"), 10), 
  y=rbinom(n=20,size=1,prob=0.5))

copy_to(sc, df, "df")

所以df当前看起来像这样:

# Source:   table<df> [?? x 3]
# Database: spark_connection
   a     b         y
   <chr> <chr> <int>
 1 a     one       0
 2 b     two       1
 3 c     one       1
 4 d     two       0
 5 a     one       1
 6 b     two       0
 7 c     one       0
 8 d     two       1
 9 a     one       0
10 b     two       1
# ... with more rows

我运行以下突变序列,并得到如下输出:

df2 <- tbl(sc, "df")
df2 %>% 
    sdf_mutate(a_idx = ft_string_indexer(a)) %>% 
    sdf_mutate(b_idx = ft_string_indexer(b)) %>% 
    sdf_mutate(a_vec = ft_one_hot_encoder(a_idx)) %>% 
    sdf_mutate(b_vec = ft_one_hot_encoder(b_idx)) %>% 
    collect()

# A tibble: 20 x 7
   a     b         y a_idx b_idx a_vec     b_vec    
   <chr> <chr> <int> <dbl> <dbl> <list>    <list>   
 1 a     one       0     0     0 <dbl [3]> <dbl [1]>
 2 b     two       1     1     1 <dbl [3]> <dbl [1]>
 3 c     one       1     2     0 <dbl [3]> <dbl [1]>
 4 d     two       0     3     1 <dbl [3]> <dbl [1]>
 5 a     one       1     0     0 <dbl [3]> <dbl [1]>
 6 b     two       0     1     1 <dbl [3]> <dbl [1]>
 7 c     one       0     2     0 <dbl [3]> <dbl [1]>
 8 d     two       1     3     1 <dbl [3]> <dbl [1]>
 9 a     one       0     0     0 <dbl [3]> <dbl [1]>
10 b     two       1     1     1 <dbl [3]> <dbl [1]>
11 c     one       1     2     0 <dbl [3]> <dbl [1]>
12 d     two       0     3     1 <dbl [3]> <dbl [1]>
13 a     one       1     0     0 <dbl [3]> <dbl [1]>
14 b     two       0     1     1 <dbl [3]> <dbl [1]>
15 c     one       0     2     0 <dbl [3]> <dbl [1]>
16 d     two       0     3     1 <dbl [3]> <dbl [1]>
17 a     one       0     0     0 <dbl [3]> <dbl [1]>
18 b     two       1     1     1 <dbl [3]> <dbl [1]>
19 c     one       0     2     0 <dbl [3]> <dbl [1]>
20 d     two       0     3     1 <dbl [3]> <dbl [1]>

在ml_logistic_regression函数中似乎无法正确使用此输出。任何有关如何优化多列的编码和正确格式以及对其进行回归的帮助都将是有帮助的!

1 个答案:

答案 0 :(得分:2)

逻辑回归分类器需要一列作为输入,因此您需要设计经过编码的a_vecb_vec中的那一列。为此,您可以使用向量汇编器,如下所示:

features_idx = ft_vector_assembler(c("a_vec", "b_vec"))