在PySpark中的连字符定界符上拆分数据帧列

时间:2019-05-10 23:11:20

标签: pyspark

我无法根据连字符定界符将数据框列分为两行。

from pyspark.mllib.linalg.distributed import IndexedRow

rows = sc.parallelize([['14-banana'], ['12-cheese'], ['13-olives'], ['11-almonds']])

rows_df = rows.toDF(["ID"])

rows_df.show()

+----------+
|        ID|
+----------+
| 14-banana|
| 12-cheese|
| 13-olives|
|11-almonds|
+----------+

所以我想要两列,一列用于数字ID,一列用于食物类型的字符串。

1 个答案:

答案 0 :(得分:1)

您正在寻找split函数。请在下面找到示例:

import pyspark.sql.functions as F

rows = sc.parallelize([['14-banana'], ['12-cheese'], ['13-olives'], ['11-almonds']])

rows_df = rows.toDF(["ID"])
split = F.split(rows_df.ID, '-')

rows_df = rows_df.withColumn('number', split.getItem(0))
rows_df = rows_df.withColumn('fruit', split.getItem(1))


rows_df.show()

输出:

+----------+------+-------+ 
|        ID|number|  fruit| 
+----------+------+-------+ 
| 14-banana|    14| banana| 
| 12-cheese|    12| cheese| 
| 13-olives|    13| olives| 
|11-almonds|    11|almonds| 
+----------+------+-------+