Spark从文本文件创建DataFrame

时间:2016-08-11 19:47:14

标签: scala apache-spark

我正在尝试从spark中的文本文件创建dataFrame,但它会抛出错误,这是我的代码;

case class BusinessSchema(business_id: String, name: String, address: String, city: String, postal_code: String, latitude: String, longitude: String, phone_number: String, tax_code: String,
business_certificate: String, application_date: String, owner_name: String, owner_address: String, owner_city: String, owner_state: String, owner_zip: String)

val businessDataFrame = sc.textFile(s"$baseDir/businesses_plus.txt").map(x=>x.split("\t")).map{
  case Array(business_id, name, address, city, postal_code, latitude, longitude, phone_number, tax_code,business_certificate, application_date, owner_name, owner_address, owner_city, owner_state, owner_zip) => BusinessSchema(business_id, name, address, city, postal_code, latitude, longitude, phone_number, tax_code,business_certificate, application_date, owner_name, owner_address, owner_city, owner_state, owner_zip)} 

val businessRecords = businessDataFrame.toDF()

运行此代码时出现错误;

businessRecords.take(20)

抛出的错误代码;

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 23.0 failed 1 times, most recent failure: Lost task 0.0 in stage 23.0 (TID 25, localhost): scala.MatchError: [Ljava.lang.String;@6da1c3f1 (of class [Ljava.lang.String;)

1 个答案:

答案 0 :(得分:2)

MatchError表示模式匹配失败 - 没有任何案例匹配某些输入。在这种情况下,您有一个大小写,将split("\t")的结果与一个恰好包含16个元素的数组相匹配。

您的数据可能包含一些不遵循此假设的记录(包含少于或多于16个制表符分隔的字段),这将导致此异常。

要克服这个问题 - 要么map替换为collect(f: PartialFunction[T, U]),这会占用PartialFunction(可能会默默地忽略不要输入的sc.textFile(s"$baseDir/businesses_plus.txt").map(x=>x.split("\t")).collect { case Array(business_id, name, address, city, postal_code, latitude, longitude, phone_number, tax_code,business_certificate, application_date, owner_name, owner_address, owner_city, owner_state, owner_zip) => BusinessSchema(business_id, name, address, city, postal_code, latitude, longitude, phone_number, tax_code,business_certificate, application_date, owner_name, owner_address, owner_city, owner_state, owner_zip) } 。 t匹配任何情况),这只是过滤掉所有错误记录:

RDD[BusinessSchema]

- 添加案例以捕获错误记录并对其执行某些操作 - 例如,您可以将RDD[Either[BusinessSchema, Array[String]]]结果替换为val withErrors: RDD[Either[BusinessSchema, Array[String]]] = sc.textFile(s"$baseDir/businesses_plus.txt") .map(x=>x.split("\t")) .map { case Array(business_id, name, address, city, postal_code, latitude, longitude, phone_number, tax_code,business_certificate, application_date, owner_name, owner_address, owner_city, owner_state, owner_zip) => Left(BusinessSchema(business_id, name, address, city, postal_code, latitude, longitude, phone_number, tax_code,business_certificate, application_date, owner_name, owner_address, owner_city, owner_state, owner_zip)) case badArray => Right(badArray) } // filter bad records, you can log / count / ignore them val badRecords: RDD[Array[String]] = withErrors.collect { case Right(a) => a } // filter good records - you can go on as planned from here... val goodRecords: RDD[BusinessSchema] = withErrors.collect { case Left(r) => r } 以反映以下事实:一些记录无法解析,仍然有错误的数据 - 用于记录或其他指示:

#include <program/header.h>