我的计算机上有一个本地PSQL数据库。有些列将数据包含在数组中。 (以下示例)
+--------------------+
| _authors|
+--------------------+
|[u'Miller, Roger ...|
|[u'Noyes, H.Pierre']|
|[u'Berman, S.M.',...|
+--------------------+
only showing top 3 rows
root
|-- _authors: string (nullable = true)
我需要将它们作为Array / Wrapped数组读取。我如何实现这一目标?
val sqlContext: SQLContext = new SQLContext(sc)
val df_records = sqlContext.read.format("jdbc").option("url", "jdbc:postgresql://localhost:5432/dbname")
.option("driver", "org.postgresql.Driver")
.option("dbtable", "public.records")
.option("user", "name")
.option("password", "pwd").load().select("_authors")
df_records.printSchema()
我需要在管道的后期阶段爆炸这个数组/展平。
谢谢,
答案 0 :(得分:4)
我有两个问题建议:
1)我不确定它是否适用于数组,但值得一试:从源读取数据帧时可以定义特定的模式。示例:
<table>
<thead>
<tr>
<th>1</th>
<th>2</th>
<th>3</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>2</td>
<td>3</td>
</tr>
<tr>
<td>1</td>
<td>2</td>
<td>3</td>
</tr>
<tr>
<td>1</td>
<td>2</td>
<td>3</td>
</tr>
<tr>
<td>1</td>
<td>2</td>
<td>3</td>
</tr>
<tr>
<td>1</td>
<td>2</td>
<td>3</td>
</tr>
<tr>
<td>1</td>
<td>2</td>
<td>3</td>
</tr>
<tr>
<td>1</td>
<td>2</td>
<td>3</td>
</tr>
<tr>
<td>1</td>
<td>2</td>
<td>3</td>
</tr>
<tr>
<td>1</td>
<td>2</td>
<td>3</td>
</tr>
<tr>
<td>1</td>
<td>2</td>
<td>3</td>
</tr>
<tr>
<td>1</td>
<td>2</td>
<td>3</td>
</tr>
<tr>
<td>1</td>
<td>2</td>
<td>3</td>
</tr>
<tr>
<td>1</td>
<td>2</td>
<td>3</td>
</tr>
</tbody>
</table>
2)如果其他选项不起作用,目前我只能考虑定义解析UDF:
val customSchema = StructType(Seq(
StructField("_authors", DataTypes.createArrayType(StringType), true),
StructField("int_column", IntegerType, true),
// other columns...
))
val df_records = sqlContext.read
.format("jdbc")
.option("url", "jdbc:postgresql://localhost:5432/dbname")
.option("driver", "org.postgresql.Driver")
.option("dbtable", "public.records")
.option("user", "name")
.option("password", "pwd")
.schema(customSchema)
.load()
df_records.select("_authors").show()