我有以下pyspark代码来连接两个数据帧。一切看起来很简单,但输出不是这个错误。无法继续下去,请你帮忙在这里找出这个根本问题吗?
C.csv
100,2015-09-03,SG,7
200,2016-01-30,AT,9
300,2016-01-25,AU,8
400,2016-01-22,AU,7
U.csv
248,248,COUNTRY,SG,Singapore
66,66,COUNTRY,AT,Austria
65,65,COUNTRY,AU,Australia
100,Singapore
200,Austria
300,Australia
400,Australia
pyspark代码是:test.py
from pyspark import SparkConf, SparkContext
from pyspark.sql.types import StringType
from pyspark import SQLContext
conf = SparkConf().setAppName("HYBRID - READ CSV to HIVE ")
sc = SparkContext(conf=conf)
sqlContext = SQLContext(sc)
C_rdd = sc.textFile("./hybrid/C.csv").map(lambda line: line.split(","))
R_rdd = sc.textFile("./hybrid/U.csv").map(lambda line: line.encode("ascii", "ignore").split(","))
C_df = C_rdd.toDF(['C_No','Op_Dt','Try_Cd','Lb'])
R_df = R_rdd.toDF(['C_Id','P_Id','CC_Cd','C_Nm','C_Ds'])
New = C_df.join(R_df, C_df.Try_Cd == R_df.C_Nm).select(['C_No','C_Ds'])
New.show()
Pyspark Error: $spark-submit test.py
java.lang.IllegalStateException: Input row doesn't have expected number of values required by the schema. 5 fields are required while 6 values are provided.
at org.apache.spark.sql.execution.EvaluatePython$.fromJava(python.scala:225)
at org.apache.spark.sql.SQLContext$$anonfun$11.apply(SQLContext.scala:933)
at org.apache.spark.sql.SQLContext$$anonfun$11.apply(SQLContext.scala:933)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:149)
你能帮忙解决这个问题吗?
答案 0 :(得分:0)
希望你使用spark 2.x +然后尝试这个 -
from pyspark.sql.types import StructType,StringType,IntegerType,StructField
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName("HYBRID - READ CSV to HIVE ") \
.getOrCreate()
cSchema = StructType([StructField("C_No", IntegerType()),
StructField("Op_Dt", StringType()),
StructField("Try_Cd", StringType()),
StructField("Lb", IntegerType())])
uSchema = StructType([StructField("C_Id", IntegerType()),
StructField("P_Id", IntegerType()),
StructField("CC_Cd", StringType()),
StructField("C_Nm", StringType()),
StructField("C_Ds", StringType())])
c_df = spark.read.csv("c.csv",schema=cSchema)
u_df = spark.read.csv("u.csv",schema=uSchema)
New = c_df.join(u_df, c_df.Try_Cd == u_df.C_Nm).select(c_df.C_No,u_df.C_Ds)
New.show()