Question

我正在尝试使用Spark DataFrame API将我的数据帧从长到大重塑。数据集是学生问答的问题和答案的集合。这是一个巨大的数据集，Q（问题）和A（答案）大约在1到50000之间。我想收集所有可能的Q * A对并使用它们来构建列。如果学生对问题1回答1，我们将值1分配给第1_1列。否则，我们给它一个0.数据集已在S_ID，Q，A上重复删除。

在R中，我可以简单地在库reshape2中使用dcast，但我不知道如何使用Spark。我已经找到了在下面的链接中转动的解决方案，但它需要修复数量不同的Q * A对。 http://rajasoftware.net/index.php/database/91446/scala-apache-spark-pivot-dataframes-pivot-spark-dataframe

我还尝试使用用户定义的函数连接Q和A并应用交叉表但是，我从控制台得到了以下错误，即使到目前为止我只在示例数据文件上测试我的代码 -

The maximum limit of le6 pairs have been collected, which may not be all of the pairs.  
Please try reducing the amount of distinct items in your columns.

原始数据：

S_ID，Q，A
1,1,1 1,2,2 1,3,3 2,1,1 2,2,3 2,3,4 2,4,5

=＆GT;经过长期转型后：

S_ID，QA_1_1，QA_2_2，QA_3_3，QA_2_3，QA_3_4，QA_4_5
1,1,1,1,0,0,0 2,1,0,0,1,1,1

R code.  
library(dplyr); library(reshape2);  
df1 <- df %>% group_by(S_ID, Q, A) %>% filter(row_number()==1) %>% mutate(temp=1)  
df1 %>% dcast(S_ID ~ Q + A, value.var="temp", fill=0)  

Spark code.
val fnConcatenate = udf((x: String, y: String) => {"QA_"+ x +"_" + y})
df1 = df.distinct.withColumn("QA", fnConcatenate($"Q", $"A"))
df2 = stat.crosstab("S_ID", "QA")

任何想法都会受到赞赏。

Answer 1

您在此尝试做的是设计错误有两个原因：

使用密集数据集替换稀疏数据集。在内存需求和计算方面都很昂贵，而且当你有一个大型数据集时几乎不是一个好主意
您限制在本地处理数据的能力。简化一些事情Spark数据框只是RDD[Row]的包装。这意味着行越大，您在单个分区上放置的越少，因此聚合等操作会更加昂贵，并且需要更多的网络流量。

当您可以实现诸如高效压缩或聚合之类的事情时，具有适当的列存储时，宽表非常有用。从实际的角度来看，几乎所有使用宽表的方法都可以通过使用组/窗口函数来完成。

您可以尝试的一件事是使用稀疏矢量来创建类似于宽的格式：

import org.apache.spark.sql.Row
import org.apache.spark.sql.functions.max
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.ml.feature.StringIndexer
import sqlContext.implicits._

df.registerTempTable("df")
val dfComb = sqlContext.sql("SELECT s_id, CONCAT(Q, '\t', A) AS qa FROM df")

val indexer = new StringIndexer()
  .setInputCol("qa")
  .setOutputCol("idx")
  .fit(dfComb)

val indexed = indexer.transform(dfComb)

val n = indexed.agg(max("idx")).first.getDouble(0).toInt + 1

val wideLikeDF = indexed
  .select($"s_id", $"idx")
  .rdd
  .map{case Row(s_id: String, idx: Double) => (s_id, idx.toInt)}
  .groupByKey // This assumes no duplicates
  .mapValues(vals => Vectors.sparse(n, vals.map((_, 1.0)).toArray))
  .toDF("id", "qaVec")

这里的酷部分是您可以轻松地将其转换为IndexedRowMatrix，例如计算SVD

val mat = new IndexedRowMatrix(wideLikeDF.map{
  // Here we assume that s_id can be mapped directly to Long
  // If not it has to be indexed
  case Row(id: String, qaVec: SparseVector) => IndexedRow(id.toLong, qaVec)
})

val svd = mat.computeSVD(3)

或RowMatrix并获取列统计信息或计算主要组件：

val colStats = mat.toRowMatrix.computeColumnSummaryStatistic
val colSims = mat.toRowMatrix.columnSimilarities
val pc = mat.toRowMatrix.computePrincipalComponents(3)

修改：

在Spark 1.6.0+中，您可以使用pivot功能。

在大型数据集上从长到宽重塑Spark DataFrame

1 个答案: