Question

我的计算机上有一个本地PSQL数据库。有些列将数据包含在数组中。（以下示例）

+--------------------+
|            _authors|
+--------------------+
|[u'Miller, Roger ...|
|[u'Noyes, H.Pierre']|
|[u'Berman, S.M.',...|
+--------------------+
only showing top 3 rows

root
 |-- _authors: string (nullable = true)

我需要将它们作为Array / Wrapped数组读取。我如何实现这一目标？

val sqlContext: SQLContext = new SQLContext(sc)
val df_records = sqlContext.read.format("jdbc").option("url", "jdbc:postgresql://localhost:5432/dbname")
  .option("driver", "org.postgresql.Driver")
  .option("dbtable", "public.records")
  .option("user", "name")
  .option("password", "pwd").load().select("_authors")
df_records.printSchema()

我需要在管道的后期阶段爆炸这个数组/展平。

谢谢，

Answer 1

我有两个问题建议：

1）我不确定它是否适用于数组，但值得一试：从源读取数据帧时可以定义特定的模式。示例：

<table>
  <thead>
    <tr>
      <th>1</th>
      <th>2</th>
      <th>3</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>1</td>
      <td>2</td>
      <td>3</td>
    </tr>
    <tr>
      <td>1</td>
      <td>2</td>
      <td>3</td>
    </tr>
    <tr>
      <td>1</td>
      <td>2</td>
      <td>3</td>
    </tr>
    <tr>
      <td>1</td>
      <td>2</td>
      <td>3</td>
    </tr>
    <tr>
      <td>1</td>
      <td>2</td>
      <td>3</td>
    </tr>
    <tr>
      <td>1</td>
      <td>2</td>
      <td>3</td>
    </tr>
    <tr>
      <td>1</td>
      <td>2</td>
      <td>3</td>
    </tr>
    <tr>
      <td>1</td>
      <td>2</td>
      <td>3</td>
    </tr>
    <tr>
      <td>1</td>
      <td>2</td>
      <td>3</td>
    </tr>
    <tr>
      <td>1</td>
      <td>2</td>
      <td>3</td>
    </tr>
    <tr>
      <td>1</td>
      <td>2</td>
      <td>3</td>
    </tr>
    <tr>
      <td>1</td>
      <td>2</td>
      <td>3</td>
    </tr>
    <tr>
      <td>1</td>
      <td>2</td>
      <td>3</td>
    </tr>
  </tbody>
</table>

2）如果其他选项不起作用，目前我只能考虑定义解析UDF：

val customSchema = StructType(Seq(
  StructField("_authors",  DataTypes.createArrayType(StringType), true),
  StructField("int_column", IntegerType, true),
  // other columns...
))

val df_records = sqlContext.read
  .format("jdbc")
  .option("url", "jdbc:postgresql://localhost:5432/dbname")
  .option("driver", "org.postgresql.Driver")
  .option("dbtable", "public.records")
  .option("user", "name")
  .option("password", "pwd")
  .schema(customSchema)
  .load()

df_records.select("_authors").show()

有关StructType的更多详细信息：org.apache.spark.sql.types.StructType
有关定义UDF的更多示例：this tutorial

从Postgres DB

1 个答案: