Question

我正在与Apache Spark和PostgreSQL建立JDBC连接，我想在我的数据库中插入一些数据。当我使用append模式时，我需要为每个id指定DataFrame.Row。 Spark有什么方法可以创建主键吗？

Answer 1

<强> Scala的：

如果您只需要唯一编号，则可以使用zipWithUniqueId并重新创建DataFrame。首先是一些导入和虚拟数据：

import sqlContext.implicits._
import org.apache.spark.sql.Row
import org.apache.spark.sql.types.{StructType, StructField, LongType}

val df = sc.parallelize(Seq(
    ("a", -1.0), ("b", -2.0), ("c", -3.0))).toDF("foo", "bar")

提取架构以供进一步使用：

val schema = df.schema

添加ID字段：

val rows = df.rdd.zipWithUniqueId.map{
   case (r: Row, id: Long) => Row.fromSeq(id +: r.toSeq)}

创建DataFrame：

val dfWithPK = sqlContext.createDataFrame(
  rows, StructType(StructField("id", LongType, false) +: schema.fields))

Python ：

中的相同内容

from pyspark.sql import Row
from pyspark.sql.types import StructField, StructType, LongType

row = Row("foo", "bar")
row_with_index = Row(*["id"] + df.columns)

df = sc.parallelize([row("a", -1.0), row("b", -2.0), row("c", -3.0)]).toDF()

def make_row(columns):
    def _make_row(row, uid):
        row_dict = row.asDict()
        return row_with_index(*[uid] + [row_dict.get(c) for c in columns])
    return _make_row

f = make_row(df.columns)

df_with_pk = (df.rdd
    .zipWithUniqueId()
    .map(lambda x: f(*x))
    .toDF(StructType([StructField("id", LongType(), False)] + df.schema.fields)))

如果您更喜欢连续号码，则可以将zipWithUniqueId替换为zipWithIndex，但这样会有点贵。

直接使用DataFrame API ：

（通用Scala，Python，Java，R语法几乎相同）

以前我错过了monotonicallyIncreasingId函数，只要你不需要连续的数字就可以正常工作：

import org.apache.spark.sql.functions.monotonicallyIncreasingId

df.withColumn("id", monotonicallyIncreasingId).show()
// +---+----+-----------+
// |foo| bar|         id|
// +---+----+-----------+
// |  a|-1.0|17179869184|
// |  b|-2.0|42949672960|
// |  c|-3.0|60129542144|
// +---+----+-----------+

虽然有用monotonicallyIncreasingId是非确定性的。不仅id可能与执行不同，但是当后续操作包含过滤器时，不能使用额外的技巧来识别行。

注意：

也可以使用rowNumber窗口函数：

from pyspark.sql.window import Window
from pyspark.sql.functions import rowNumber

w = Window().orderBy()
df.withColumn("id", rowNumber().over(w)).show()

不幸的是：

WARN窗口：没有为窗口操作定义分区！将所有数据移动到单个分区，这可能会导致严重的性能下降。

因此，除非您有自然的方式对数据进行分区，并确保此时唯一性不是特别有用。

Answer 2

from pyspark.sql.functions import monotonically_increasing_id

df.withColumn("id", monotonically_increasing_id()).show()

请注意，df.withColumn的第二个参数是monotonically_increasing_id（）而不是monotonically_increasing_id。

Answer 3

我发现以下解决方案对于zipWithIndex（）是所需行为的情况相对简单，即对于那些想要连续的整数。

在这种情况下，我们使用pyspark并依赖字典理解将原始行对象映射到适合包含唯一索引的新模式的新字典。

# read the initial dataframe without index
dfNoIndex = sqlContext.read.parquet(dataframePath)
# Need to zip together with a unique integer

# First create a new schema with uuid field appended
newSchema = StructType([StructField("uuid", IntegerType(), False)]
                       + dfNoIndex.schema.fields)
# zip with the index, map it to a dictionary which includes new field
df = dfNoIndex.rdd.zipWithIndex()\
                      .map(lambda (row, id): {k:v
                                              for k, v
                                              in row.asDict().items() + [("uuid", id)]})\
                      .toDF(newSchema)

Apache Spark的主键

3 个答案: