在pyspark中加入多个配对的RDD

时间:2017-04-26 08:02:54

标签: apache-spark pyspark spark-dataframe

我正在尝试加入3个文件,并使用pyspark在控制台中输出最终文件。我已将它们转换为配对RDD,我可以毫无问题地加入其中的两个。但由于某种原因,我无法将第3对RDD加入以前加入的RDD。下面是3个文件的结构。

EmployeeManager.csv

E01,John
E02,Kate
E03,Emily

EmployeeName.csv

E01,Brick
E02,Blunt
E03,Leo

EmployeeSalary.csv

E01,50000
E02,50000
E03,45000

以下是我到目前为止的pyspark代码。

from pyspark import SparkConf, SparkContext
sc = SparkContext(conf=SparkConf())

manager = sc.textFile('spark1/EmployeeManager.csv')
name = sc.textFile('spark1/EmployeeName.csv')
salary = sc.textFile('spark1/EmployeeSalary.csv')

managerPairRDD = manager.map(lambda x: x.split(','))
namePairRDD = name.map(lambda x: x.split(','))
salaryPairRDD = salary.map(lambda x: x.split(','))

ns = namePairRDD.join(salaryPairRDD)
print 'After name and salary join: \n %s' %ns.collect()

nsm = managerPairRDD.join(ns)
print 'After joining 3 files: %s' %nsm.collect()

程序在最后一步停止运行。以下是控制台输出

[cloudera@quickstart Spark]$ pyspark q7.py
WARNING: Running python applications through 'pyspark' is deprecated as of Spark 1.0.
Use ./bin/spark-submit <python file>
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/lib/zookeeper/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/lib/flume-ng/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
After name and salary join:                                                     
[(u'E02', (u'Blunt', u'50000')), (u'E03', (u'Leo', u'45000')), (u'E01', (u'Brick', u'50000'))]
[Stage 3:=======================================>                   (2 + 0) / 3]

请让我知道如何解决此问题。非常感谢任何帮助。

谢谢,

1 个答案:

答案 0 :(得分:0)

最后,我通过将输入文件转换为数据帧来解决这个问题。

from pyspark import SparkConf, SparkContext
from pyspark import SQLContext

sc = SparkContext(conf=SparkConf())
sqlContext = SQLContext(sc)

manager = sc.textFile('spark1/EmployeeManager.csv')
name = sc.textFile('spark1/EmployeeName.csv')
salary = sc.textFile('spark1/EmployeeSalary.csv')

manager_df = manager.map(lambda x: list(x.split(','))).toDF(["col1","col2"])
name_df = name.map(lambda x: list(x.split(','))).toDF(["col1","col2"])
salary_df = salary.map(lambda x: list(x.split(','))).toDF(["col1","col2"])

nsm = name_df.alias('name_df') \
.join(salary_df.alias('salary_df'), name_df.col1==salary_df.col1) \
.join(manager_df.alias('manager_df'), name_df.col1==manager_df.col1) \
.select(name_df.col1, name_df.col2, salary_df.col2, manager_df.col2)

nsm.saveAsTextFile('/spark1/q7sol')