我正在尝试在Spark Scala中读取大文件,然后尝试执行连接。 当我使用小文件进行测试时,它可以很好地工作,但是对于更大的文件,我会得到一些错误。
我设法撤出了一个我收到错误的文件。 文件大小为1 GB,并且在最后创建分区时会抛出此错误,我将文件名拆分为获取列。
在此行之后
val rdd = sc.textFile(mainFileURL)
val header = rdd.filter(_.contains("uniqueFundamentalSet")).map(line => line.split("\\|\\^\\|")).first()
val schema = StructType(header.map(cols => StructField(cols.replace(".", "_"), StringType)).toSeq)
println(schema)
val data = sqlContext.createDataFrame(rdd.filter(!_.contains("uniqueFundamentalSet")).map(line => Row.fromSeq(line.split("\\|\\^\\|").toSeq)), schema)
这是罪魁祸首
val data = sqlContext.createDataFrame(rdd.filter(!_.contains("uniqueFundamentalSet")).map(line => Row.fromSeq(line.split("\\|\\^\\|").toSeq)), schema)
请建议我该如何处理。
当我做rdd.count时,我得到了价值。 但是当我做data.count()时,我得到了错误
Caused by: java.lang.RuntimeException: Error while encoding: java.lang.ArrayIndexOutOfBoundsException: 37
if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null else staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, validateexternaltype(getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true]), 0, uniqueFundamentalSet), StringType), true) AS uniqueFundamentalSet#0
I
这是我的样本数据集
uniqueFundamentalSet|^|PeriodId|^|SourceId|^|StatementTypeCode|^|StatementCurrencyId|^|FinancialStatementLineItem.lineItemId|^|FinancialAsReportedLineItemName|^|FinancialAsReportedLineItemName.languageId|^|FinancialStatementLineItemValue|^|AdjustedForCorporateActionValue|^|ReportedCurrencyId|^|IsAsReportedCurrencySetManually|^|Unit|^|IsTotal|^|StatementSectionCode|^|DimentionalLineItemId|^|IsDerived|^|EstimateMethodCode|^|EstimateMethodNote|^|EstimateMethodNote.languageId|^|FinancialLineItemSource|^|IsCombinedItem|^|IsExcludedFromStandardization|^|DocByteOffset|^|DocByteLength|^|BookMark|^|ItemDisplayedNegativeFlag|^|ItemScalingFactor|^|ItemDisplayedValue|^|ReportedValue|^|EditedDescription|^|EditedDescription.languageId|^|ReportedDescription|^|ReportedDescription.languageId|^|AsReportedInstanceSequence|^|PhysicalMeasureId|^|FinancialStatementLineItemSequence|^|SystemDerivedTypeCode|^|AsReportedExchangeRate|^|AsReportedExchangeRateSourceCurrencyId|^|ThirdPartySourceCode|^|FinancialStatementLineItemValueUpperRange|^|FinancialStatementLineItemLocalLanguageLabel|^|FinancialStatementLineItemLocalLanguageLabel.languageId|^|IsFinal|^|FinancialStatementLineItem.lineItemInstanceKey|^|StatementSectionIsCredit|^|CapitalChangeAdjustmentDate|^|ParentLineItemId|^|EstimateMethodId|^|StatementSectionId|^|SystemDerivedTypeCodeId|^|UnitEnumerationId|^|FiscalYear|^|IsAnnual|^|PeriodPermId|^|PeriodPermId.objectTypeId|^|PeriodPermId.objectType|^|AuditID|^|AsReportedItemId|^|ExpressionInstanceId|^|ExpressionText|^|FFAction|!|
192730239205|^|235|^|1|^|FTN|^|500186|^|221|^|Average Age of Employees|^|505074|^|30.00000|^||^||^|False|^|1.00000|^|False|^|EMP|^||^|False|^|ARV|^||^|505074|^||^|False|^|False|^||^||^||^||^|0|^||^||^||^|505074|^||^|505074|^||^||^|122880|^|NA|^||^||^|TK |^||^||^|505126|^|True|^|1235002211206722736|^|True|^||^||^|3019656|^|3013652|^|3019679|^|1010066|^|1976|^|True|^||^|1000220295|^||^||^||^||^||^|I|!|
答案 0 :(得分:3)
过滤掉不匹配的行
最简单的方法之一是过滤掉所有与模式长度不匹配的行,然后再应用模式构建数据框
val requiredNumberOfFields = schema.fieldNames.length //added to take the number of columns required
val data = sqlContext
.createDataFrame(
rdd
.filter(!_.contains("uniqueFundamentalSet"))
.map(line => line.split("\\|\\^\\|"))
.filter(_.length == requiredNumberOfFields) //added to filter in only the rows which has the same number of fields required in schema
.map(x => Row.fromSeq(x.toSeq))
, schema)
添加虚拟字符串或过滤掉额外的字符串
您可以编写一个函数来检查长度。如果数据长度小于模式,则可以添加虚拟字符串。如果数据的长度更长,您可以删除额外的数据
val requiredNumberOfFields = schema.fieldNames.length
def appendDummyData(row: Array[String], len: Int) = row.length == len match {
case true => row
case false => if(len > row.length) {
val add = (for(loop <- 1 to len - row.length) yield "dummy").toArray
row ++ add
} else row.take(len)
}
val data = sqlContext
.createDataFrame(
rdd
.filter(!_.contains("uniqueFundamentalSet"))
.map(line => line.split("\\|\\^\\|"))
.map(x => Row.fromSeq(appendDummyData(x, requiredNumberOfFields).toSeq)) //calling the custom function for checking the length
, schema)
我希望答案很有帮助