我有两个问题:
如何在Spark中将多个分类变量转换为虚拟变量的大矩阵?
如何使用one_hot_encoder获得正确的输出并运行(逻辑)回归?
我对如何使用ft_string_indexer
和ft_one_hot_encoder
获得正确的tbl感到困惑。
作为一个例子,我制作了当前数据框:
library(sparklyr)
library(tidyverse)
sc <- spark_connect(master="yarn-client", spark_home =Sys.getenv("SPARK_HOME"), app_name = "sparklyr",
version = "2.1.2", hadoop_version = "2.6", config = configs)
df <- data.frame(
a=rep(letters[1:4],5),
b=rep(c("one", "two"), 10),
y=rbinom(n=20,size=1,prob=0.5))
copy_to(sc, df, "df")
所以df
当前看起来像这样:
# Source: table<df> [?? x 3]
# Database: spark_connection
a b y
<chr> <chr> <int>
1 a one 0
2 b two 1
3 c one 1
4 d two 0
5 a one 1
6 b two 0
7 c one 0
8 d two 1
9 a one 0
10 b two 1
# ... with more rows
我运行以下突变序列,并得到如下输出:
df2 <- tbl(sc, "df")
df2 %>%
sdf_mutate(a_idx = ft_string_indexer(a)) %>%
sdf_mutate(b_idx = ft_string_indexer(b)) %>%
sdf_mutate(a_vec = ft_one_hot_encoder(a_idx)) %>%
sdf_mutate(b_vec = ft_one_hot_encoder(b_idx)) %>%
collect()
# A tibble: 20 x 7
a b y a_idx b_idx a_vec b_vec
<chr> <chr> <int> <dbl> <dbl> <list> <list>
1 a one 0 0 0 <dbl [3]> <dbl [1]>
2 b two 1 1 1 <dbl [3]> <dbl [1]>
3 c one 1 2 0 <dbl [3]> <dbl [1]>
4 d two 0 3 1 <dbl [3]> <dbl [1]>
5 a one 1 0 0 <dbl [3]> <dbl [1]>
6 b two 0 1 1 <dbl [3]> <dbl [1]>
7 c one 0 2 0 <dbl [3]> <dbl [1]>
8 d two 1 3 1 <dbl [3]> <dbl [1]>
9 a one 0 0 0 <dbl [3]> <dbl [1]>
10 b two 1 1 1 <dbl [3]> <dbl [1]>
11 c one 1 2 0 <dbl [3]> <dbl [1]>
12 d two 0 3 1 <dbl [3]> <dbl [1]>
13 a one 1 0 0 <dbl [3]> <dbl [1]>
14 b two 0 1 1 <dbl [3]> <dbl [1]>
15 c one 0 2 0 <dbl [3]> <dbl [1]>
16 d two 0 3 1 <dbl [3]> <dbl [1]>
17 a one 0 0 0 <dbl [3]> <dbl [1]>
18 b two 1 1 1 <dbl [3]> <dbl [1]>
19 c one 0 2 0 <dbl [3]> <dbl [1]>
20 d two 0 3 1 <dbl [3]> <dbl [1]>
在ml_logistic_regression函数中似乎无法正确使用此输出。任何有关如何优化多列的编码和正确格式以及对其进行回归的帮助都将是有帮助的!
答案 0 :(得分:2)
逻辑回归分类器需要一列作为输入,因此您需要设计经过编码的a_vec
和b_vec
中的那一列。为此,您可以使用向量汇编器,如下所示:
features_idx = ft_vector_assembler(c("a_vec", "b_vec"))