fixed_width_column = { "id": (1, 3), "名称": (4, 3), “年龄”:(7, 2), “薪水”:(9, 4) }
答案 0 :(得分:1)
最快,就是像这样使用substring
>>> df = sc.parallelize([('123asd122000',),('234dfg221000',)]).toDF(['fullstr'])
>>> df.show()
+------------+
| fullstr|
+------------+
|123asd122000|
|234dfg221000|
+------------+
>>> df.withColumn('id',substring('fullstr',1,3)).withColumn('name',substring('fullstr',4,3)).show()
+------------+---+----+
| fullstr| id|name|
+------------+---+----+
|123asd122000|123| asd|
|234dfg221000|234| dfg|
+------------+---+----+
>>> df.withColumn('id',substring('fullstr',1,3)) \
... .withColumn('name',substring('fullstr',4,3)) \
... .withColumn('age',substring('fullstr',7,2)) \
... .withColumn('salary',substring('fullstr',9,4)) \
... .show()
+------------+---+----+---+------+
| fullstr| id|name|age|salary|
+------------+---+----+---+------+
|123asd122000|123| asd| 12| 2000|
|234dfg221000|234| dfg| 22| 1000|
+------------+---+----+---+------+
我也可以从文件中读取它。例如我有一个带有标题的文件 file.txt
。
% cat file.txt
fullstr
123asd122000
234dfg221000
322sfg213400
124gse235900
使用阅读
>>> spark.read.option("header","true").csv('file:///Users/bala/Desktop/file.txt').show()
+------------+
| fullstr|
+------------+
|123asd122000|
|234dfg221000|
|322sfg213400|
|124gse235900|
+------------+