我需要将10GB固定宽度的文件读取到数据帧。如何在R中使用Spark做到这一点?
假设我的文本数据如下:
text <- c("0001BRAjonh ",
"0002USAmarina ",
"0003GBPcharles")
我希望前4个字符与数据帧的“ ID”列关联;从字符5-7开始,将与“国家/地区”列相关联;并从字符8-14关联到“名称”列
如果数据集很小,我将使用read.fwf函数,但事实并非如此。
我可以使用sparklyr :: spark_read_text函数将其读取为文本文件。但是我不知道如何将文件的值正确地分配给数据框。
答案 0 :(得分:-1)
编辑:由于种种原因,忘记说子字符串从1开始,数组从0开始。
仔细阅读并添加我在以上专栏中讨论的代码。
该过程是动态的,并且基于名为Input_Table的Hive表。该表有5列:Table_Name,Column_Name,Column_Ordinal_Position,Column_Start和Column_Length。它是外部的,因此任何用户都可以更改,删除和删除任何文件到文件夹位置。我很快就从头开始构建它,而不需要实际的代码,这一切有意义吗?
#Call Input DataFrame and the Hive Table. For hive table we make sure to only take correct column as well as the columns in correct order.
val inputDF = spark.read.format(recordFormat).option("header","false").load(folderLocation + "/" + tableName + "." + tableFormat).rdd.toDF("Odd_Long_Name")
val inputSchemaDF = spark.sql("select * from Input_Table where Table_Name = '" + tableName + "'").sort($"Column_Ordinal_Position")
#Build all the arrays from the columns, rdd to map to collect changes a dataframe col to a array of strings. In this format I can iterator through the column.
val columnNameArray = inputSchemaDF.selectExpr("Column_Name").rdd.map(x=>x.mkString).collect
val columnStartArray = inputSchemaDF.selectExpr("Column_Start_Position").rdd.map(x=>x.mkString).collect
val columnLengthArray = inputSchemaDF.selectExpr("Column_Length").rdd.map(x=>x.mkString).collect
#Make the iteraros as well as other variables that are meant to be overwritten
var columnAllocationIterator = 1
var localCommand = ""
var commandArray = Array("")
#Loop as there are as many columns in input table
while (columnAllocationIterator <= columnNameArray.length) {
#overwrite the string command with the new command, thought odd long name was too accurate to not place into the code
localCommand = "substring(Odd_Long_Name, " + columnStartArray(columnAllocationIterator-1) + ", " + columnLengthArray(columnAllocationIterator-1) + ") as " + columnNameArray(columnAllocationIterator-1)
#If the code is running the first time it overwrites the command array, else it just appends
if (columnAllocationIterator==1) {
commandArray = Array(localCommand)
} else {
commandArray = commandArray ++ Array(localCommand)
}
#I really like iterating my iterators like this
columnAllocationIterator = columnAllocationIterator + 1
}
#Run all elements of the string array indepently against the table
val finalDF = inputDF.selectExpr(commandArray:_*)