Question

我正在尝试在Spark Scala中读取大文件，然后尝试执行连接。当我使用小文件进行测试时，它可以很好地工作，但是对于更大的文件，我会得到一些错误。

我设法撤出了一个我收到错误的文件。文件大小为1 GB，并且在最后创建分区时会抛出此错误，我将文件名拆分为获取列。

在此行之后

 val rdd = sc.textFile(mainFileURL)
      val header = rdd.filter(_.contains("uniqueFundamentalSet")).map(line => line.split("\\|\\^\\|")).first()
      val schema = StructType(header.map(cols => StructField(cols.replace(".", "_"), StringType)).toSeq)
      println(schema)
      val data = sqlContext.createDataFrame(rdd.filter(!_.contains("uniqueFundamentalSet")).map(line => Row.fromSeq(line.split("\\|\\^\\|").toSeq)), schema)

这是罪魁祸首

 val data = sqlContext.createDataFrame(rdd.filter(!_.contains("uniqueFundamentalSet")).map(line => Row.fromSeq(line.split("\\|\\^\\|").toSeq)), schema)

请建议我该如何处理。

当我做rdd.count时，我得到了价值。但是当我做data.count（）时，我得到了错误

Caused by: java.lang.RuntimeException: Error while encoding: java.lang.ArrayIndexOutOfBoundsException: 37
if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null else staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, validateexternaltype(getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true]), 0, uniqueFundamentalSet), StringType), true) AS uniqueFundamentalSet#0
I

这是我的样本数据集

uniqueFundamentalSet|^|PeriodId|^|SourceId|^|StatementTypeCode|^|StatementCurrencyId|^|FinancialStatementLineItem.lineItemId|^|FinancialAsReportedLineItemName|^|FinancialAsReportedLineItemName.languageId|^|FinancialStatementLineItemValue|^|AdjustedForCorporateActionValue|^|ReportedCurrencyId|^|IsAsReportedCurrencySetManually|^|Unit|^|IsTotal|^|StatementSectionCode|^|DimentionalLineItemId|^|IsDerived|^|EstimateMethodCode|^|EstimateMethodNote|^|EstimateMethodNote.languageId|^|FinancialLineItemSource|^|IsCombinedItem|^|IsExcludedFromStandardization|^|DocByteOffset|^|DocByteLength|^|BookMark|^|ItemDisplayedNegativeFlag|^|ItemScalingFactor|^|ItemDisplayedValue|^|ReportedValue|^|EditedDescription|^|EditedDescription.languageId|^|ReportedDescription|^|ReportedDescription.languageId|^|AsReportedInstanceSequence|^|PhysicalMeasureId|^|FinancialStatementLineItemSequence|^|SystemDerivedTypeCode|^|AsReportedExchangeRate|^|AsReportedExchangeRateSourceCurrencyId|^|ThirdPartySourceCode|^|FinancialStatementLineItemValueUpperRange|^|FinancialStatementLineItemLocalLanguageLabel|^|FinancialStatementLineItemLocalLanguageLabel.languageId|^|IsFinal|^|FinancialStatementLineItem.lineItemInstanceKey|^|StatementSectionIsCredit|^|CapitalChangeAdjustmentDate|^|ParentLineItemId|^|EstimateMethodId|^|StatementSectionId|^|SystemDerivedTypeCodeId|^|UnitEnumerationId|^|FiscalYear|^|IsAnnual|^|PeriodPermId|^|PeriodPermId.objectTypeId|^|PeriodPermId.objectType|^|AuditID|^|AsReportedItemId|^|ExpressionInstanceId|^|ExpressionText|^|FFAction|!|
192730239205|^|235|^|1|^|FTN|^|500186|^|221|^|Average Age of Employees|^|505074|^|30.00000|^||^||^|False|^|1.00000|^|False|^|EMP|^||^|False|^|ARV|^||^|505074|^||^|False|^|False|^||^||^||^||^|0|^||^||^||^|505074|^||^|505074|^||^||^|122880|^|NA|^||^||^|TK |^||^||^|505126|^|True|^|1235002211206722736|^|True|^||^||^|3019656|^|3013652|^|3019679|^|1010066|^|1976|^|True|^||^|1000220295|^||^||^||^||^||^|I|!|

Answer 1

过滤掉不匹配的行

最简单的方法之一是过滤掉所有与模式长度不匹配的行，然后再应用模式构建数据框

val requiredNumberOfFields = schema.fieldNames.length   //added to take the number of columns required
val data = sqlContext
  .createDataFrame(
    rdd
      .filter(!_.contains("uniqueFundamentalSet"))
      .map(line => line.split("\\|\\^\\|"))
      .filter(_.length == requiredNumberOfFields)    //added to filter in only the rows which has the same number of fields required in schema
      .map(x => Row.fromSeq(x.toSeq))
    , schema)

添加虚拟字符串或过滤掉额外的字符串

您可以编写一个函数来检查长度。如果数据长度小于模式，则可以添加虚拟字符串。如果数据的长度更长，您可以删除额外的数据

val requiredNumberOfFields = schema.fieldNames.length
def appendDummyData(row: Array[String], len: Int) = row.length == len match {
  case true => row
  case false => if(len > row.length) {
    val add = (for(loop <- 1 to len - row.length) yield "dummy").toArray
    row ++ add
  } else row.take(len)
}
val data = sqlContext
  .createDataFrame(
    rdd
      .filter(!_.contains("uniqueFundamentalSet"))
      .map(line => line.split("\\|\\^\\|"))
      .map(x => Row.fromSeq(appendDummyData(x, requiredNumberOfFields).toSeq))   //calling the custom function for checking the length
    , schema)

我希望答案很有帮助

在Spark Scala中进行编码时出现ArrayIndexOutOfBoundsException

1 个答案: