我正在尝试加入3个文件,并使用pyspark在控制台中输出最终文件。我已将它们转换为配对RDD,我可以毫无问题地加入其中的两个。但由于某种原因,我无法将第3对RDD加入以前加入的RDD。下面是3个文件的结构。
EmployeeManager.csv
E01,John
E02,Kate
E03,Emily
EmployeeName.csv
E01,Brick
E02,Blunt
E03,Leo
EmployeeSalary.csv
E01,50000
E02,50000
E03,45000
以下是我到目前为止的pyspark代码。
from pyspark import SparkConf, SparkContext
sc = SparkContext(conf=SparkConf())
manager = sc.textFile('spark1/EmployeeManager.csv')
name = sc.textFile('spark1/EmployeeName.csv')
salary = sc.textFile('spark1/EmployeeSalary.csv')
managerPairRDD = manager.map(lambda x: x.split(','))
namePairRDD = name.map(lambda x: x.split(','))
salaryPairRDD = salary.map(lambda x: x.split(','))
ns = namePairRDD.join(salaryPairRDD)
print 'After name and salary join: \n %s' %ns.collect()
nsm = managerPairRDD.join(ns)
print 'After joining 3 files: %s' %nsm.collect()
程序在最后一步停止运行。以下是控制台输出
[cloudera@quickstart Spark]$ pyspark q7.py
WARNING: Running python applications through 'pyspark' is deprecated as of Spark 1.0.
Use ./bin/spark-submit <python file>
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/lib/zookeeper/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/lib/flume-ng/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
After name and salary join:
[(u'E02', (u'Blunt', u'50000')), (u'E03', (u'Leo', u'45000')), (u'E01', (u'Brick', u'50000'))]
[Stage 3:=======================================> (2 + 0) / 3]
请让我知道如何解决此问题。非常感谢任何帮助。
谢谢,
答案 0 :(得分:0)
最后,我通过将输入文件转换为数据帧来解决这个问题。
from pyspark import SparkConf, SparkContext
from pyspark import SQLContext
sc = SparkContext(conf=SparkConf())
sqlContext = SQLContext(sc)
manager = sc.textFile('spark1/EmployeeManager.csv')
name = sc.textFile('spark1/EmployeeName.csv')
salary = sc.textFile('spark1/EmployeeSalary.csv')
manager_df = manager.map(lambda x: list(x.split(','))).toDF(["col1","col2"])
name_df = name.map(lambda x: list(x.split(','))).toDF(["col1","col2"])
salary_df = salary.map(lambda x: list(x.split(','))).toDF(["col1","col2"])
nsm = name_df.alias('name_df') \
.join(salary_df.alias('salary_df'), name_df.col1==salary_df.col1) \
.join(manager_df.alias('manager_df'), name_df.col1==manager_df.col1) \
.select(name_df.col1, name_df.col2, salary_df.col2, manager_df.col2)
nsm.saveAsTextFile('/spark1/q7sol')