如何在Sparklyr中多次选择Spark数据帧的同一列?

时间:2018-01-15 07:29:13

标签: r select dplyr sparklyr

我有一个Spark数据框sdf。我想生成另一个包含sdf列的表,但这些列可以重复。

以下是所需的输出。

> sdf %>% select(DC1_Y1,DC2_Y1,DC2_Y1)
# Source:   lazy query [?? x 3]
# Database: spark_connection
        DC1_Y1       DC2_Y1      DC2_Y1
         <dbl>        <dbl>       <dbl>
 1  0.004576808 -0.004568069 -0.004568069
 2  0.000000000  0.000000000  0.000000000
 3  0.015242054  0.026584149  0.026584149
 4  0.004344194  0.006570250  0.006570250
 5  0.009738776  0.009713972  0.009713972
 6  0.007298836  0.005504776  0.005504776
 7  0.002613870  0.000000000  0.000000000
 8  0.006483329  0.009653164  0.009653164
 9 -0.002290456 -0.002294758 -0.002294758
10  0.003802521  0.007625295  0.007625295
# ... with more rows

而是发生以下情况:

> sdf %>% select(DC1_Y1,DC2_Y1,DC2_Y1)
# Source:   lazy query [?? x 2] 
# Database: spark_connection
     DC1_Y1       DC2_Y1
      <dbl>        <dbl>
 1  0.004576808 -0.004568069
 2  0.000000000  0.000000000
 3  0.015242054  0.026584149
 4  0.004344194  0.006570250
 5  0.009738776  0.009713972
 6  0.007298836  0.005504776
 7  0.002613870  0.000000000
 8  0.006483329  0.009653164
 9 -0.002290456 -0.002294758
10  0.003802521  0.007625295
# ... with more rows

知道如何实现所需的输出吗?

谢谢,

修改

迷你示例

set.seed(1)
sdf_copy_to(sc, data.frame(DC1_Y1= runif(10),DC2_Y1=runif(10)) , "Test") -> sdf.test
sdf.test %>% select(DC1_Y1,DC2_Y1,DC2_Y1)
# Source:   lazy query [?? x 2]
# Database: spark_connection
   DC1_Y1    DC2_Y1
    <dbl>     <dbl>
 1 0.26550866 0.2059746
 2 0.37212390 0.1765568
 3 0.57285336 0.6870228
 4 0.90820779 0.3841037
 5 0.20168193 0.7698414
 6 0.89838968 0.4976992
 7 0.94467527 0.7176185
 8 0.66079779 0.9919061
 9 0.62911404 0.3800352
10 0.06178627 0.7774452
# ... with more rows

sc是一些Spark实例

1 个答案:

答案 0 :(得分:0)

确定。目前这个问题可能没有完美的答案。 这是一个解决方法的答案:

> spark_apply(sdf,function(x) {x[,c("DC1_Y1","DC2_Y1","DC2_Y1")]})  
# Source:   table<sparklyr_tmp_106672656ef0> [?? x 3]
# Database: spark_connection
         ID       DC1_Y1       DC2_Y1
        <dbl>        <dbl>       <dbl> 
 1  0.004576808 -0.004568069 -0.004568069
 2  0.000000000  0.000000000  0.000000000
 3  0.015242054  0.026584149  0.026584149
 4  0.004344194  0.006570250  0.006570250
 5  0.009738776  0.009713972  0.009713972
 6  0.007298836  0.005504776  0.005504776
 7  0.002613870  0.000000000  0.000000000
 8  0.006483329  0.009653164  0.009653164
 9 -0.002290456 -0.002294758 -0.002294758
10  0.003802521  0.007625295  0.007625295
# ... with more rows

列名显然是错误的,但至少内容是正确的。

对于那些想要使用变量选择列的人,请参阅以下问题: How to pass variables to functions called in spark_apply()?