从Postgres DB

时间:2016-05-05 22:01:11

标签: apache-spark

我的计算机上有一个本地PSQL数据库。有些列将数据包含在数组中。 (以下示例)

+--------------------+
|            _authors|
+--------------------+
|[u'Miller, Roger ...|
|[u'Noyes, H.Pierre']|
|[u'Berman, S.M.',...|
+--------------------+
only showing top 3 rows

root
 |-- _authors: string (nullable = true)

我需要将它们作为Array / Wrapped数组读取。我如何实现这一目标?

val sqlContext: SQLContext = new SQLContext(sc)
val df_records = sqlContext.read.format("jdbc").option("url", "jdbc:postgresql://localhost:5432/dbname")
  .option("driver", "org.postgresql.Driver")
  .option("dbtable", "public.records")
  .option("user", "name")
  .option("password", "pwd").load().select("_authors")
df_records.printSchema()

我需要在管道的后期阶段爆炸这个数组/展平。

谢谢,

1 个答案:

答案 0 :(得分:4)

我有两个问题建议:

1)我不确定它是否适用于数组,但值得一试:从源读取数据帧时可以定义特定的模式。示例:

<table>
  <thead>
    <tr>
      <th>1</th>
      <th>2</th>
      <th>3</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>1</td>
      <td>2</td>
      <td>3</td>
    </tr>
    <tr>
      <td>1</td>
      <td>2</td>
      <td>3</td>
    </tr>
    <tr>
      <td>1</td>
      <td>2</td>
      <td>3</td>
    </tr>
    <tr>
      <td>1</td>
      <td>2</td>
      <td>3</td>
    </tr>
    <tr>
      <td>1</td>
      <td>2</td>
      <td>3</td>
    </tr>
    <tr>
      <td>1</td>
      <td>2</td>
      <td>3</td>
    </tr>
    <tr>
      <td>1</td>
      <td>2</td>
      <td>3</td>
    </tr>
    <tr>
      <td>1</td>
      <td>2</td>
      <td>3</td>
    </tr>
    <tr>
      <td>1</td>
      <td>2</td>
      <td>3</td>
    </tr>
    <tr>
      <td>1</td>
      <td>2</td>
      <td>3</td>
    </tr>
    <tr>
      <td>1</td>
      <td>2</td>
      <td>3</td>
    </tr>
    <tr>
      <td>1</td>
      <td>2</td>
      <td>3</td>
    </tr>
    <tr>
      <td>1</td>
      <td>2</td>
      <td>3</td>
    </tr>
  </tbody>
</table>

2)如果其他选项不起作用,目前我只能考虑定义解析UDF:

val customSchema = StructType(Seq(
  StructField("_authors",  DataTypes.createArrayType(StringType), true),
  StructField("int_column", IntegerType, true),
  // other columns...
))

val df_records = sqlContext.read
  .format("jdbc")
  .option("url", "jdbc:postgresql://localhost:5432/dbname")
  .option("driver", "org.postgresql.Driver")
  .option("dbtable", "public.records")
  .option("user", "name")
  .option("password", "pwd")
  .schema(customSchema)
  .load()

df_records.select("_authors").show()