我正在尝试从spark中的文本文件创建dataFrame,但它会抛出错误,这是我的代码;
case class BusinessSchema(business_id: String, name: String, address: String, city: String, postal_code: String, latitude: String, longitude: String, phone_number: String, tax_code: String,
business_certificate: String, application_date: String, owner_name: String, owner_address: String, owner_city: String, owner_state: String, owner_zip: String)
val businessDataFrame = sc.textFile(s"$baseDir/businesses_plus.txt").map(x=>x.split("\t")).map{
case Array(business_id, name, address, city, postal_code, latitude, longitude, phone_number, tax_code,business_certificate, application_date, owner_name, owner_address, owner_city, owner_state, owner_zip) => BusinessSchema(business_id, name, address, city, postal_code, latitude, longitude, phone_number, tax_code,business_certificate, application_date, owner_name, owner_address, owner_city, owner_state, owner_zip)}
val businessRecords = businessDataFrame.toDF()
运行此代码时出现错误;
businessRecords.take(20)
抛出的错误代码;
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 23.0 failed 1 times, most recent failure: Lost task 0.0 in stage 23.0 (TID 25, localhost): scala.MatchError: [Ljava.lang.String;@6da1c3f1 (of class [Ljava.lang.String;)
答案 0 :(得分:2)
MatchError
表示模式匹配失败 - 没有任何案例匹配某些输入。在这种情况下,您有一个单大小写,将split("\t")
的结果与一个恰好包含16个元素的数组相匹配。
您的数据可能包含一些不遵循此假设的记录(包含少于或多于16个制表符分隔的字段),这将导致此异常。
要克服这个问题 - 要么将map
替换为collect(f: PartialFunction[T, U])
,这会占用PartialFunction
(可能会默默地忽略不要输入的sc.textFile(s"$baseDir/businesses_plus.txt").map(x=>x.split("\t")).collect {
case Array(business_id, name, address, city, postal_code, latitude, longitude, phone_number, tax_code,business_certificate, application_date, owner_name, owner_address, owner_city, owner_state, owner_zip) => BusinessSchema(business_id, name, address, city, postal_code, latitude, longitude, phone_number, tax_code,business_certificate, application_date, owner_name, owner_address, owner_city, owner_state, owner_zip)
}
。 t匹配任何情况),这只是过滤掉所有错误记录:
RDD[BusinessSchema]
或 - 添加案例以捕获错误记录并对其执行某些操作 - 例如,您可以将RDD[Either[BusinessSchema, Array[String]]]
结果替换为val withErrors: RDD[Either[BusinessSchema, Array[String]]] = sc.textFile(s"$baseDir/businesses_plus.txt")
.map(x=>x.split("\t"))
.map {
case Array(business_id, name, address, city, postal_code, latitude, longitude, phone_number, tax_code,business_certificate, application_date, owner_name, owner_address, owner_city, owner_state, owner_zip) => Left(BusinessSchema(business_id, name, address, city, postal_code, latitude, longitude, phone_number, tax_code,business_certificate, application_date, owner_name, owner_address, owner_city, owner_state, owner_zip))
case badArray => Right(badArray)
}
// filter bad records, you can log / count / ignore them
val badRecords: RDD[Array[String]] = withErrors.collect { case Right(a) => a }
// filter good records - you can go on as planned from here...
val goodRecords: RDD[BusinessSchema] = withErrors.collect { case Left(r) => r }
以反映以下事实:一些记录无法解析,仍然有错误的数据 - 用于记录或其他指示:
#include <program/header.h>