来自rdd的pyspark数据帧包含键和值作为列表列表

时间:2017-03-31 06:59:00

标签: apache-spark pyspark spark-dataframe rdd

我有一个像下面这样的RDD,键和值作为包含一些参数的列表列表。

(32719, [[[u'200.73.55.34', u'192.16.48.217', 0, 6, 10163, 443, 0], [u'177.207.76.243', u'192.16.58.8', 0, 6, 59575, 80, 0]])
(32897, [[[u'200.73.55.34', u'193.16.48.217', 0, 6, 10163, 443, 0], [u'167.207.76.243', u'194.16.58.8', 0, 6, 59575, 80, 0]])

我想创建一个包含行和列的数据框,如下所示

32719, '200.73.55.34', u'192.16.48.217', 0, 6, 10163, 443, 0
32719, '177.207.76.243', u'192.16.58.8', 0, 6, 59575, 80, 0
32897, 200.73.55.34', u'193.16.48.217', 0, 6, 10163, 443, 0

或者只是所有值的数据框,但按键分组。我怎么能这样做。

2 个答案:

答案 0 :(得分:3)

使用平面地图值

a =[(32719, [[[u'200.73.55.34', u'192.16.48.217', 0, 6, 10163, 443, 0], [u'177.207.76.243', u'192.16.58.8', 0, 6, 59575, 80, 0]]]),
(32897, [[[u'200.73.55.34', u'193.16.48.217', 0, 6, 10163, 443, 0], [u'167.207.76.243', u'194.16.58.8', 0, 6, 59575, 80, 0]]])]

rdd =sc.parallelize(a)

rdd.flatMapValues(lambda x:x[0]).map(lambda x: [x[0]]+x[1]).toDF().show()

输出

+-------+----------------+---------------+----+----+-------+-----+----+
|  _1   |       _2       |      _3       | _4 | _5 |  _6   | _7  | _8 |
+-------+----------------+---------------+----+----+-------+-----+----+
| 32719 | 200.73.55.34   | 192.16.48.217 |  0 |  6 | 10163 | 443 |  0 |
| 32719 | 177.207.76.243 | 192.16.58.8   |  0 |  6 | 59575 |  80 |  0 |
| 32897 | 200.73.55.34   | 193.16.48.217 |  0 |  6 | 10163 | 443 |  0 |
| 32897 | 167.207.76.243 | 194.16.58.8   |  0 |  6 | 59575 |  80 |  0 |
+-------+----------------+---------------+----+----+-------+-----+----+

答案 1 :(得分:0)

您可以映射以将键添加到每个值并创建数据帧。我试试了,

>>>dat1 = [(32719, [[u'200.73.55.34', u'192.16.48.217', 0, 6, 10163, 443, 0], [u'177.207.76.243', u'192.16.58.8', 0, 6, 59575, 80, 0]]),(32897, [[u'200.73.55.34', u'193.16.48.217', 0, 6, 10163, 443, 0], [u'167.207.76.243', u'194.16.58.8', 0, 6, 59575, 80, 0]])]

>>>rdd1 = sc.parallelize(dat1).map(lambda x : [[x[0]]+i for i in x[1]]).flatMap(lambda x : (x))
>>>df = rdd1.toDF(['col1','col2','col3','col4','col5','col6','col7','col8'])
>>> df.show()
+-----+--------------+-------------+----+----+-----+----+----+
| col1|          col2|         col3|col4|col5| col6|col7|col8|
+-----+--------------+-------------+----+----+-----+----+----+
|32719|  200.73.55.34|192.16.48.217|   0|   6|10163| 443|   0|
|32719|177.207.76.243|  192.16.58.8|   0|   6|59575|  80|   0|
|32897|  200.73.55.34|193.16.48.217|   0|   6|10163| 443|   0|
|32897|167.207.76.243|  194.16.58.8|   0|   6|59575|  80|   0|
+-----+--------------+-------------+----+----+-----+----+----+