我有一个Spark数据框sdf
。我想生成另一个包含sdf
列的表,但这些列可以重复。
以下是所需的输出。
> sdf %>% select(DC1_Y1,DC2_Y1,DC2_Y1)
# Source: lazy query [?? x 3]
# Database: spark_connection
DC1_Y1 DC2_Y1 DC2_Y1
<dbl> <dbl> <dbl>
1 0.004576808 -0.004568069 -0.004568069
2 0.000000000 0.000000000 0.000000000
3 0.015242054 0.026584149 0.026584149
4 0.004344194 0.006570250 0.006570250
5 0.009738776 0.009713972 0.009713972
6 0.007298836 0.005504776 0.005504776
7 0.002613870 0.000000000 0.000000000
8 0.006483329 0.009653164 0.009653164
9 -0.002290456 -0.002294758 -0.002294758
10 0.003802521 0.007625295 0.007625295
# ... with more rows
而是发生以下情况:
> sdf %>% select(DC1_Y1,DC2_Y1,DC2_Y1)
# Source: lazy query [?? x 2]
# Database: spark_connection
DC1_Y1 DC2_Y1
<dbl> <dbl>
1 0.004576808 -0.004568069
2 0.000000000 0.000000000
3 0.015242054 0.026584149
4 0.004344194 0.006570250
5 0.009738776 0.009713972
6 0.007298836 0.005504776
7 0.002613870 0.000000000
8 0.006483329 0.009653164
9 -0.002290456 -0.002294758
10 0.003802521 0.007625295
# ... with more rows
知道如何实现所需的输出吗?
谢谢,
修改
迷你示例
set.seed(1)
sdf_copy_to(sc, data.frame(DC1_Y1= runif(10),DC2_Y1=runif(10)) , "Test") -> sdf.test
sdf.test %>% select(DC1_Y1,DC2_Y1,DC2_Y1)
# Source: lazy query [?? x 2]
# Database: spark_connection
DC1_Y1 DC2_Y1
<dbl> <dbl>
1 0.26550866 0.2059746
2 0.37212390 0.1765568
3 0.57285336 0.6870228
4 0.90820779 0.3841037
5 0.20168193 0.7698414
6 0.89838968 0.4976992
7 0.94467527 0.7176185
8 0.66079779 0.9919061
9 0.62911404 0.3800352
10 0.06178627 0.7774452
# ... with more rows
sc
是一些Spark实例
答案 0 :(得分:0)
确定。目前这个问题可能没有完美的答案。 这是一个解决方法的答案:
> spark_apply(sdf,function(x) {x[,c("DC1_Y1","DC2_Y1","DC2_Y1")]})
# Source: table<sparklyr_tmp_106672656ef0> [?? x 3]
# Database: spark_connection
ID DC1_Y1 DC2_Y1
<dbl> <dbl> <dbl>
1 0.004576808 -0.004568069 -0.004568069
2 0.000000000 0.000000000 0.000000000
3 0.015242054 0.026584149 0.026584149
4 0.004344194 0.006570250 0.006570250
5 0.009738776 0.009713972 0.009713972
6 0.007298836 0.005504776 0.005504776
7 0.002613870 0.000000000 0.000000000
8 0.006483329 0.009653164 0.009653164
9 -0.002290456 -0.002294758 -0.002294758
10 0.003802521 0.007625295 0.007625295
# ... with more rows
列名显然是错误的,但至少内容是正确的。
对于那些想要使用变量选择列的人,请参阅以下问题: How to pass variables to functions called in spark_apply()?