如何从 Fixed_width_column(字典)创建数据框 - Pyspark

时间:2021-07-06 20:55:53

标签: python pyspark

<块引用>

fixed_width_column = { "id": (1, 3), "名称": (4, 3), “年龄”:(7, 2), “薪水”:(9, 4) }


文件 ->
123asd122000
234dfg221000
322sfg213400
124gse235900


如何通过从字典中推断模式将具有上述结构的文件转换为数据帧。

1 个答案:

答案 0 :(得分:1)

最快,就是像这样使用substring

>>> df = sc.parallelize([('123asd122000',),('234dfg221000',)]).toDF(['fullstr'])
>>> df.show()
+------------+
|     fullstr|
+------------+
|123asd122000|
|234dfg221000|
+------------+

>>> df.withColumn('id',substring('fullstr',1,3)).withColumn('name',substring('fullstr',4,3)).show()
+------------+---+----+
|     fullstr| id|name|
+------------+---+----+
|123asd122000|123| asd|
|234dfg221000|234| dfg|
+------------+---+----+

>>> df.withColumn('id',substring('fullstr',1,3)) \
... .withColumn('name',substring('fullstr',4,3)) \
... .withColumn('age',substring('fullstr',7,2)) \
... .withColumn('salary',substring('fullstr',9,4)) \
... .show()
+------------+---+----+---+------+
|     fullstr| id|name|age|salary|
+------------+---+----+---+------+
|123asd122000|123| asd| 12|  2000|
|234dfg221000|234| dfg| 22|  1000|
+------------+---+----+---+------+

我也可以从文件中读取它。例如我有一个带有标题的文件 file.txt

% cat file.txt 
fullstr
123asd122000
234dfg221000
322sfg213400
124gse235900

使用阅读

>>> spark.read.option("header","true").csv('file:///Users/bala/Desktop/file.txt').show()
+------------+
|     fullstr|
+------------+
|123asd122000|
|234dfg221000|
|322sfg213400|
|124gse235900|
+------------+