sortbykey()似乎不适用于pyspark中的字符串

时间:2017-05-31 09:08:23

标签: python apache-spark pyspark transformation rdd

我在“TESTSortbykey.md”文件中保存了两首诗“玛丽有一只小羊羔”并在 PYSPARK 中对其执行了以下命令:

testsortbykey=sc.textFile("file:///opt/hadoop/spark-1.6.0/TESTSortbykey.md").flatMap(lambda x: x.split(" ")).map(lambda x: (x,1))

在运行testsortbykey.collect()时,我得到了输出:

  

[(u'Mary',1),(u'had',1),(u'a',1),(u'little',1),(u'lamb',1),   (你们',1),(u'fleece',1),(你是',1),(你是',1),(你是',1),   (你现在',1),(你和',1),(你','1),(你在',1),(你不是',1),   (u'Mary',1),(u'went',1),(u'the',1),(u'Lamb',1),(u'was',1),   (你的',1),(你','1),(你','1),(你',1)]

一旦我有了一个Pair RDD testsortbykey,我想应用reduceByKey()和sortByKey(),但两者似乎都不起作用。我使用的命令是:

 testsortbykey.sortByKey()
 testsortbykey.collect()
 testsortbykey.reduceByKey(lambda x,y: x+y )
 testsortbykey.collect()

我在两种情况下得到的输出是:

  

[(u'Mary',1),(u'had',1),(u'a',1),(u'little',1),(u'lamb',1),   (你们',1),(u'fleece',1),(你是',1),(你是',1),(你是',1),   (你现在',1),(你和',1),(你','1),(你在',1),(你不是',1),   (u'Mary',1),(u'went',1),(u'the',1),(u'Lamb',1),(u'was',1),   (你的',1),(你','1),(你','1),(你',1)]

显然,即使有多个相同的键(例如'Mary','have'等),值也没有合并。

有人可以解释为什么吗?我还应该怎么做才能克服这个问题?

编辑: 这就是我的控制台的样子,希望这会有所帮助:

    >>> testsortbykey=sc.textFile("file:///opt/hadoop/spark-1.6.0/TESTSortbykey.md").flatMap(lambda x: x.split(" ")).map(lambda x: (x,1))
17/06/01 11:44:48 INFO storage.MemoryStore: Block broadcast_103 stored as values in memory (estimated size 228.9 KB, free 4.4 MB)
17/06/01 11:44:48 INFO storage.MemoryStore: Block broadcast_103_piece0 stored as bytes in memory (estimated size 19.5 KB, free 4.4 MB)
17/06/01 11:44:48 INFO storage.BlockManagerInfo: Added broadcast_103_piece0 in memory on localhost:57701 (size: 19.5 KB, free: 511.1 MB)
17/06/01 11:44:48 INFO spark.SparkContext: Created broadcast 103 from textFile at null:-1
>>> testsortbykey.sortByKey()
17/06/01 11:45:48 INFO mapred.FileInputFormat: Total input paths to process : 1
17/06/01 11:45:48 INFO spark.SparkContext: Starting job: sortByKey at <stdin>:1
17/06/01 11:45:48 INFO scheduler.DAGScheduler: Got job 74 (sortByKey at <stdin>:1) with 2 output partitions
17/06/01 11:45:48 INFO scheduler.DAGScheduler: Final stage: ResultStage 89 (sortByKey at <stdin>:1)
17/06/01 11:45:48 INFO scheduler.DAGScheduler: Parents of final stage: List()
17/06/01 11:45:48 INFO scheduler.DAGScheduler: Missing parents: List()
17/06/01 11:45:48 INFO scheduler.DAGScheduler: Submitting ResultStage 89 (PythonRDD[200] at sortByKey at <stdin>:1), which has no missing parents
17/06/01 11:45:48 INFO storage.MemoryStore: Block broadcast_104 stored as values in memory (estimated size 6.2 KB, free 4.4 MB)
17/06/01 11:45:48 INFO storage.MemoryStore: Block broadcast_104_piece0 stored as bytes in memory (estimated size 3.9 KB, free 4.4 MB)
17/06/01 11:45:48 INFO storage.BlockManagerInfo: Added broadcast_104_piece0 in memory on localhost:57701 (size: 3.9 KB, free: 511.1 MB)
17/06/01 11:45:48 INFO spark.SparkContext: Created broadcast 104 from broadcast at DAGScheduler.scala:1006
17/06/01 11:45:48 INFO scheduler.DAGScheduler: Submitting 2 missing tasks from ResultStage 89 (PythonRDD[200] at sortByKey at <stdin>:1)
17/06/01 11:45:48 INFO scheduler.TaskSchedulerImpl: Adding task set 89.0 with 2 tasks
17/06/01 11:45:48 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 89.0 (TID 183, localhost, partition 0,PROCESS_LOCAL, 2147 bytes)
17/06/01 11:45:48 INFO scheduler.TaskSetManager: Starting task 1.0 in stage 89.0 (TID 184, localhost, partition 1,PROCESS_LOCAL, 2147 bytes)
17/06/01 11:45:48 INFO executor.Executor: Running task 0.0 in stage 89.0 (TID 183)
17/06/01 11:45:48 INFO executor.Executor: Running task 1.0 in stage 89.0 (TID 184)
17/06/01 11:45:48 INFO rdd.HadoopRDD: Input split: file:/opt/hadoop/spark-1.6.0/TESTSortbykey.md:55+55
17/06/01 11:45:48 INFO rdd.HadoopRDD: Input split: file:/opt/hadoop/spark-1.6.0/TESTSortbykey.md:0+55
17/06/01 11:45:48 INFO python.PythonRunner: Times: total = 3, boot = 1, init = 1, finish = 1
17/06/01 11:45:48 INFO executor.Executor: Finished task 0.0 in stage 89.0 (TID 183). 2124 bytes result sent to driver
17/06/01 11:45:48 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 89.0 (TID 183) in 9 ms on localhost (1/2)
17/06/01 11:45:48 INFO python.PythonRunner: Times: total = 7, boot = 3, init = 4, finish = 0
17/06/01 11:45:48 INFO executor.Executor: Finished task 1.0 in stage 89.0 (TID 184). 2124 bytes result sent to driver
17/06/01 11:45:48 INFO scheduler.TaskSetManager: Finished task 1.0 in stage 89.0 (TID 184) in 13 ms on localhost (2/2)
17/06/01 11:45:48 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 89.0, whose tasks have all completed, from pool 
17/06/01 11:45:48 INFO scheduler.DAGScheduler: ResultStage 89 (sortByKey at <stdin>:1) finished in 0.013 s
17/06/01 11:45:48 INFO scheduler.DAGScheduler: Job 74 finished: sortByKey at <stdin>:1, took 0.017325 s
17/06/01 11:45:48 INFO spark.SparkContext: Starting job: sortByKey at <stdin>:1
17/06/01 11:45:48 INFO scheduler.DAGScheduler: Got job 75 (sortByKey at <stdin>:1) with 2 output partitions
17/06/01 11:45:48 INFO scheduler.DAGScheduler: Final stage: ResultStage 90 (sortByKey at <stdin>:1)
17/06/01 11:45:48 INFO scheduler.DAGScheduler: Parents of final stage: List()
17/06/01 11:45:48 INFO scheduler.DAGScheduler: Missing parents: List()
17/06/01 11:45:48 INFO scheduler.DAGScheduler: Submitting ResultStage 90 (PythonRDD[201] at sortByKey at <stdin>:1), which has no missing parents
17/06/01 11:45:48 INFO storage.MemoryStore: Block broadcast_105 stored as values in memory (estimated size 6.0 KB, free 4.4 MB)
17/06/01 11:45:48 INFO storage.MemoryStore: Block broadcast_105_piece0 stored as bytes in memory (estimated size 3.9 KB, free 4.4 MB)
17/06/01 11:45:48 INFO storage.BlockManagerInfo: Added broadcast_105_piece0 in memory on localhost:57701 (size: 3.9 KB, free: 511.1 MB)
17/06/01 11:45:48 INFO spark.SparkContext: Created broadcast 105 from broadcast at DAGScheduler.scala:1006
17/06/01 11:45:48 INFO scheduler.DAGScheduler: Submitting 2 missing tasks from ResultStage 90 (PythonRDD[201] at sortByKey at <stdin>:1)
17/06/01 11:45:48 INFO scheduler.TaskSchedulerImpl: Adding task set 90.0 with 2 tasks
17/06/01 11:45:48 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 90.0 (TID 185, localhost, partition 0,PROCESS_LOCAL, 2147 bytes)
17/06/01 11:45:48 INFO scheduler.TaskSetManager: Starting task 1.0 in stage 90.0 (TID 186, localhost, partition 1,PROCESS_LOCAL, 2147 bytes)
17/06/01 11:45:48 INFO executor.Executor: Running task 1.0 in stage 90.0 (TID 186)
17/06/01 11:45:48 INFO executor.Executor: Running task 0.0 in stage 90.0 (TID 185)
17/06/01 11:45:48 INFO rdd.HadoopRDD: Input split: file:/opt/hadoop/spark-1.6.0/TESTSortbykey.md:0+55
17/06/01 11:45:48 INFO rdd.HadoopRDD: Input split: file:/opt/hadoop/spark-1.6.0/TESTSortbykey.md:55+55
17/06/01 11:45:48 INFO python.PythonRunner: Times: total = 42, boot = -8, init = 49, finish = 1
17/06/01 11:45:48 INFO python.PythonRunner: Times: total = 41, boot = -6, init = 47, finish = 0
17/06/01 11:45:48 INFO executor.Executor: Finished task 0.0 in stage 90.0 (TID 185). 2382 bytes result sent to driver
17/06/01 11:45:48 INFO executor.Executor: Finished task 1.0 in stage 90.0 (TID 186). 2223 bytes result sent to driver
17/06/01 11:45:48 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 90.0 (TID 185) in 49 ms on localhost (1/2)
17/06/01 11:45:48 INFO scheduler.TaskSetManager: Finished task 1.0 in stage 90.0 (TID 186) in 51 ms on localhost (2/2)
17/06/01 11:45:48 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 90.0, whose tasks have all completed, from pool 
17/06/01 11:45:48 INFO scheduler.DAGScheduler: ResultStage 90 (sortByKey at <stdin>:1) finished in 0.051 s
17/06/01 11:45:48 INFO scheduler.DAGScheduler: Job 75 finished: sortByKey at <stdin>:1, took 0.055618 s
PythonRDD[206] at RDD at PythonRDD.scala:43
>>> testsortbykey.collect()
17/06/01 11:46:04 INFO spark.SparkContext: Starting job: collect at <stdin>:1
17/06/01 11:46:04 INFO scheduler.DAGScheduler: Got job 76 (collect at <stdin>:1) with 2 output partitions
17/06/01 11:46:04 INFO scheduler.DAGScheduler: Final stage: ResultStage 91 (collect at <stdin>:1)
17/06/01 11:46:04 INFO scheduler.DAGScheduler: Parents of final stage: List()
17/06/01 11:46:04 INFO scheduler.DAGScheduler: Missing parents: List()
17/06/01 11:46:04 INFO scheduler.DAGScheduler: Submitting ResultStage 91 (PythonRDD[207] at collect at <stdin>:1), which has no missing parents
17/06/01 11:46:04 INFO storage.MemoryStore: Block broadcast_106 stored as values in memory (estimated size 5.3 KB, free 4.4 MB)
17/06/01 11:46:04 INFO storage.MemoryStore: Block broadcast_106_piece0 stored as bytes in memory (estimated size 3.3 KB, free 4.4 MB)
17/06/01 11:46:04 INFO storage.BlockManagerInfo: Added broadcast_106_piece0 in memory on localhost:57701 (size: 3.3 KB, free: 511.1 MB)
17/06/01 11:46:04 INFO spark.SparkContext: Created broadcast 106 from broadcast at DAGScheduler.scala:1006
17/06/01 11:46:04 INFO scheduler.DAGScheduler: Submitting 2 missing tasks from ResultStage 91 (PythonRDD[207] at collect at <stdin>:1)
17/06/01 11:46:04 INFO scheduler.TaskSchedulerImpl: Adding task set 91.0 with 2 tasks
17/06/01 11:46:04 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 91.0 (TID 187, localhost, partition 0,PROCESS_LOCAL, 2147 bytes)
17/06/01 11:46:04 INFO scheduler.TaskSetManager: Starting task 1.0 in stage 91.0 (TID 188, localhost, partition 1,PROCESS_LOCAL, 2147 bytes)
17/06/01 11:46:04 INFO executor.Executor: Running task 0.0 in stage 91.0 (TID 187)
17/06/01 11:46:04 INFO executor.Executor: Running task 1.0 in stage 91.0 (TID 188)
17/06/01 11:46:04 INFO rdd.HadoopRDD: Input split: file:/opt/hadoop/spark-1.6.0/TESTSortbykey.md:0+55
17/06/01 11:46:04 INFO rdd.HadoopRDD: Input split: file:/opt/hadoop/spark-1.6.0/TESTSortbykey.md:55+55
17/06/01 11:46:04 INFO python.PythonRunner: Times: total = 41, boot = -16016, init = 16056, finish = 1
17/06/01 11:46:04 INFO python.PythonRunner: Times: total = 41, boot = -16017, init = 16057, finish = 1
17/06/01 11:46:04 INFO executor.Executor: Finished task 0.0 in stage 91.0 (TID 187). 2451 bytes result sent to driver
17/06/01 11:46:04 INFO executor.Executor: Finished task 1.0 in stage 91.0 (TID 188). 2252 bytes result sent to driver
17/06/01 11:46:04 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 91.0 (TID 187) in 48 ms on localhost (1/2)
17/06/01 11:46:04 INFO scheduler.TaskSetManager: Finished task 1.0 in stage 91.0 (TID 188) in 49 ms on localhost (2/2)
17/06/01 11:46:04 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 91.0, whose tasks have all completed, from pool 
17/06/01 11:46:04 INFO scheduler.DAGScheduler: ResultStage 91 (collect at <stdin>:1) finished in 0.051 s
17/06/01 11:46:04 INFO scheduler.DAGScheduler: Job 76 finished: collect at <stdin>:1, took 0.055614 s
[(u'Mary', 1), (u'had', 1), (u'a', 1), (u'little', 1), (u'lamb', 1), (u'whose', 1), (u'fleece', 1), (u'was', 1), (u'white', 1), (u'as', 1), (u'snow', 1), (u'and', 1), (u'every', 1), (u'where', 1), (u'that', 1), (u'Mary', 1), (u'went', 1), (u'the', 1), (u'Lamb', 1), (u'was', 1), (u'sure', 1), (u'to', 1), (u'go', 1), (u'', 1)]
>>> testsortbykey.reduceByKey(lambda x,y: x+y)
PythonRDD[212] at RDD at PythonRDD.scala:43
>>> testsortbykey.collect()
17/06/01 11:47:06 INFO spark.SparkContext: Starting job: collect at <stdin>:1
17/06/01 11:47:06 INFO scheduler.DAGScheduler: Got job 77 (collect at <stdin>:1) with 2 output partitions
17/06/01 11:47:06 INFO scheduler.DAGScheduler: Final stage: ResultStage 92 (collect at <stdin>:1)
17/06/01 11:47:06 INFO scheduler.DAGScheduler: Parents of final stage: List()
17/06/01 11:47:06 INFO scheduler.DAGScheduler: Missing parents: List()
17/06/01 11:47:06 INFO scheduler.DAGScheduler: Submitting ResultStage 92 (PythonRDD[207] at collect at <stdin>:1), which has no missing parents
17/06/01 11:47:06 INFO storage.MemoryStore: Block broadcast_107 stored as values in memory (estimated size 5.3 KB, free 4.5 MB)
17/06/01 11:47:06 INFO storage.MemoryStore: Block broadcast_107_piece0 stored as bytes in memory (estimated size 3.3 KB, free 4.5 MB)
17/06/01 11:47:06 INFO storage.BlockManagerInfo: Added broadcast_107_piece0 in memory on localhost:57701 (size: 3.3 KB, free: 511.1 MB)
17/06/01 11:47:06 INFO spark.SparkContext: Created broadcast 107 from broadcast at DAGScheduler.scala:1006
17/06/01 11:47:06 INFO scheduler.DAGScheduler: Submitting 2 missing tasks from ResultStage 92 (PythonRDD[207] at collect at <stdin>:1)
17/06/01 11:47:06 INFO scheduler.TaskSchedulerImpl: Adding task set 92.0 with 2 tasks
17/06/01 11:47:06 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 92.0 (TID 189, localhost, partition 0,PROCESS_LOCAL, 2147 bytes)
17/06/01 11:47:06 INFO scheduler.TaskSetManager: Starting task 1.0 in stage 92.0 (TID 190, localhost, partition 1,PROCESS_LOCAL, 2147 bytes)
17/06/01 11:47:06 INFO executor.Executor: Running task 0.0 in stage 92.0 (TID 189)
17/06/01 11:47:06 INFO executor.Executor: Running task 1.0 in stage 92.0 (TID 190)
17/06/01 11:47:06 INFO rdd.HadoopRDD: Input split: file:/opt/hadoop/spark-1.6.0/TESTSortbykey.md:55+55
17/06/01 11:47:06 INFO rdd.HadoopRDD: Input split: file:/opt/hadoop/spark-1.6.0/TESTSortbykey.md:0+55
17/06/01 11:47:06 INFO python.PythonRunner: Times: total = 3, boot = 2, init = 1, finish = 0
17/06/01 11:47:06 INFO executor.Executor: Finished task 1.0 in stage 92.0 (TID 190). 2252 bytes result sent to driver
17/06/01 11:47:06 INFO scheduler.TaskSetManager: Finished task 1.0 in stage 92.0 (TID 190) in 13 ms on localhost (1/2)
17/06/01 11:47:06 INFO python.PythonRunner: Times: total = 11, boot = 3, init = 7, finish = 1
17/06/01 11:47:06 INFO executor.Executor: Finished task 0.0 in stage 92.0 (TID 189). 2451 bytes result sent to driver
17/06/01 11:47:06 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 92.0 (TID 189) in 16 ms on localhost (2/2)
17/06/01 11:47:06 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 92.0, whose tasks have all completed, from pool 
17/06/01 11:47:06 INFO scheduler.DAGScheduler: ResultStage 92 (collect at <stdin>:1) finished in 0.017 s
17/06/01 11:47:06 INFO scheduler.DAGScheduler: Job 77 finished: collect at <stdin>:1, took 0.020758 s
[(u'Mary', 1), (u'had', 1), (u'a', 1), (u'little', 1), (u'lamb', 1), (u'whose', 1), (u'fleece', 1), (u'was', 1), (u'white', 1), (u'as', 1), (u'snow', 1), (u'and', 1), (u'every', 1), (u'where', 1), (u'that', 1), (u'Mary', 1), (u'went', 1), (u'the', 1), (u'Lamb', 1), (u'was', 1), (u'sure', 1), (u'to', 1), (u'go', 1), (u'', 1)]
>>> 

1 个答案:

答案 0 :(得分:0)

第一步是正确的:

>>> rdd = sc.textFile("./yourFile.md").flatMap(lambda x: x.split(" ")).map(lambda x: (x,1))

>>> rdd.collect()
[(u'Mary', 1), (u'had', 1), (u'a', 1), (u'little', 1), (u'lamb', 1), 
(u"It's", 1), (u'fleece', 1), (u'was', 1), (u'white', 1), (u'as', 1), 
(u'snow,', 1), (u'yeah', 1), (u'Everywhere', 1), (u'the', 1), (u'child', 1),
(u'went', 1), (u'The', 1), (u'lamb,', 1), (u'the', 1), (u'lamb', 1), 
(u'was', 1), (u'sure', 1), (u'to', 1), (u'go,', 1), (u'yeah', 1)]

问题是什么?

如果你这样做:

>>> rdd.reduceByKey(lambda x,y: x+y)

然后这个:

>>> rdd.collect()
[(u'Mary', 1), (u'had', 1), (u'a', 1), (u'little', 1), (u'lamb', 1), 
(u"It's", 1), (u'fleece', 1), (u'was', 1), (u'white', 1), (u'as', 1), 
(u'snow,', 1), (u'yeah', 1), (u'Everywhere', 1), (u'the', 1), (u'child', 1),
(u'went', 1), (u'The', 1), (u'lamb,', 1), (u'the', 1), (u'lamb', 1), 
(u'was', 1), (u'sure', 1), (u'to', 1), (u'go,', 1), (u'yeah', 1)]

您只应用了转换但未更改起始rdd。

但是......

第一个选项(如果您只想查看转换):

>>> rdd.reduceByKey(lambda x,y: x+y).collect()  
[(u'a', 1), (u'lamb', 2), (u'little', 1), (u'white', 1), (u'had', 1), 
(u'fleece', 1), (u'The', 1), (u'snow,', 1), (u'Everywhere', 1), (u'went', 1), (u'was', 2),
(u'the', 2), (u'as', 1), (u'go,', 1), (u'sure', 1), (u'lamb,', 1), 
(u"It's", 1), (u'yeah', 2), (u'to', 1), (u'child', 1), (u'Mary', 1)]

第二个选项(如果您想在新的rdd中保存转换):

如果你这样做:

>>> rddReduced = rdd.reduceByKey(lambda x,y: x+y)

然后这个:

>>> rddReduced.collect()
[(u'a', 1), (u'lamb', 2), (u'little', 1), (u'white', 1), (u'had', 1), 
(u'fleece', 1), (u'The', 1), (u'snow,', 1), (u'Everywhere', 1), (u'went', 1), 
(u'was', 2), (u'the', 2), (u'as', 1), (u'go,', 1), (u'sure', 1), (u'lamb,', 1), 
(u"It's", 1), (u'yeah', 2), (u'to', 1), (u'child', 1), (u'Mary', 1)]

您已应用并保存了转换,结果就是您要找的内容。

如果要应用sortByKey()

,则使用相同的概念