我有一个像这样的火花数据框:
+------+--------+--------------+--------------------+
| dbn| boro|total_students| sBus|
+------+--------+--------------+--------------------+
|17K548|Brooklyn| 399|[B41, B43, B44-SB...|
|09X543| Bronx| 378|[Bx13, Bx15, Bx17...|
|09X327| Bronx| 543|[Bx1, Bx11, Bx13,...|
+------+--------+--------------+--------------------+
如何对它进行展平,以便为sBus中的每个元素复制每一行,并且sBus将是一个普通的字符串列?
所以结果会是这样的:
+------+--------+--------------+--------------------+
| dbn| boro|total_students| sBus|
+------+--------+--------------+--------------------+
|17K548|Brooklyn| 399| B41 |
|17K548|Brooklyn| 399| B43 |
|17K548|Brooklyn| 399| B44-SB |
+------+--------+--------------+--------------------+
依旧......
答案 0 :(得分:2)
我无法想到这样做的方法而不将其变成RDD。
# convert df to rdd
rdd = df.rdd
def extract(row, key):
"""Takes dictionary and key, returns tuple of (dict w/o key, dict[key])."""
_dict = row.asDict()
_list = _dict[key]
del _dict[key]
return (_dict, _list)
def add_to_dict(_dict, key, value):
_dict[key] = value
return _dict
# preserve rest of values in key, put list to flatten in value
rdd = rdd.map(lambda x: extract(x, 'sBus'))
# make a row for each item in value
rdd = rdd.flatMapValues(lambda x: x)
# add flattened value back into dictionary
rdd = rdd.map(lambda x: add_to_dict(x[0], 'sBus', x[1]))
# convert back to dataframe
df = sqlContext.createDataFrame(rdd)
df.show()
棘手的部分是将其他列与新展平的值保持在一起。我这样做是将每一行映射到(dict of other columns, list to flatten)
的元组,然后调用flatMapValues
。这会将值列表的每个元素拆分成一个单独的行,但保持附加的键,即
(key, ['A', 'B', 'C'])
变为
(key, 'A')
(key, 'B')
(key, 'C')
然后,我将展平的值移回其他列的字典中,然后将其重新转换回DataFrame。