Question

<块引用>

fixed_width_column = { "id": (1, 3), "名称": (4, 3), “年龄”：（7, 2）， “薪水”：(9, 4) }

文件 ->
123asd122000
234dfg221000
322sfg213400
124gse235900

如何通过从字典中推断模式将具有上述结构的文件转换为数据帧。

Answer 1

最快，就是像这样使用substring

>>> df = sc.parallelize([('123asd122000',),('234dfg221000',)]).toDF(['fullstr'])
>>> df.show()
+------------+
|     fullstr|
+------------+
|123asd122000|
|234dfg221000|
+------------+

>>> df.withColumn('id',substring('fullstr',1,3)).withColumn('name',substring('fullstr',4,3)).show()
+------------+---+----+
|     fullstr| id|name|
+------------+---+----+
|123asd122000|123| asd|
|234dfg221000|234| dfg|
+------------+---+----+

>>> df.withColumn('id',substring('fullstr',1,3)) \
... .withColumn('name',substring('fullstr',4,3)) \
... .withColumn('age',substring('fullstr',7,2)) \
... .withColumn('salary',substring('fullstr',9,4)) \
... .show()
+------------+---+----+---+------+
|     fullstr| id|name|age|salary|
+------------+---+----+---+------+
|123asd122000|123| asd| 12|  2000|
|234dfg221000|234| dfg| 22|  1000|
+------------+---+----+---+------+

我也可以从文件中读取它。例如我有一个带有标题的文件 file.txt。

% cat file.txt 
fullstr
123asd122000
234dfg221000
322sfg213400
124gse235900

使用阅读

>>> spark.read.option("header","true").csv('file:///Users/bala/Desktop/file.txt').show()
+------------+
|     fullstr|
+------------+
|123asd122000|
|234dfg221000|
|322sfg213400|
|124gse235900|
+------------+

如何从 Fixed_width_column（字典）创建数据框 - Pyspark

1 个答案: