Question

我有如下数据：

+----+----+
|user|item|
+----+----+
|   a|   1|
|   a|   2|
|   a|   3|
|   b|   1|
|   b|   5|
|   b|   4|
|   b|   7|
|   c|  10|
|   c|   2|
+----+----+

我希望在经过一些转换之后获得数据，如下所示：

(a,(a,1))
(a,(a,2))
(a,(a,3))
(b,(b,1))
(b,(b,5))
(b,(b,4))
(b,(b,7))
(c,(c,10))
(c,(c,2))

他们可能是单独的rdds。对我来说没问题。

可以使用scala和java中的数据集以及groupbykey和flatmapgroups的组合来完成，但遗憾的是pyspark中没有数据集或flatmapgroup。

我在pypsark上尝试了一些flatmap和flatmapvalues转换，但我无法获得正确的结果。

如何通过使用pyspark获得预期结果？

Answer 1

请您查看下面的代码。我认为您可以使用此代码段找到解决方案。

[root @ sandbox work] #hadoop dfs -put sample.txt / user /

sample.txt的

a|1
a|2
a|3
b|1
b|5
b|4
b|7
c|10
c|2

[root @ sandbox work] #pyspark

lines = sc.textFile("hdfs://sandbox/user/sample.txt")

def parse(line):
    return (line.split('|')[0], (line.split('|')[0], line.split('|')[1]))

parsed_lines = lines.map(parse)

parsed_lines.collect()

[(u'a', (u'a', u'1')), (u'a', (u'a', u'2')), (u'a', (u'a', u'3')), (u'b', (u'b', u'1')), (u'b', (u'b', u'5')), (u'b', (u'b', u'4')), (u'b', (u'b', u'7')), (u'c', (u'c', u'10')), (u'c', (u'c', u'2'))]

PySpark相当于Flatmapgroups RDD

1 个答案: