我有一个如下所示的数据框df
+--------+--------------------+--------+------+
| id| path|somestff| hash1|
+--------+--------------------+--------+------+
| 1|/file/dirA/fileA.txt| 58| 65161|
| 2|/file/dirB/fileB.txt| 52| 65913|
| 3|/file/dirC/fileC.txt| 99|131073|
| 4|/file/dirF/fileD.txt| 46|196233|
+--------+--------------------+--------+------+
一个注意事项:/ file / dir不同。并非所有文件都存储在同一目录中。实际上,各个目录中都有数百个文件。
我要在此处完成的操作是读取列路径中的文件,并对文件中的记录进行计数,并将行计数的结果写入数据帧的新列中。
我尝试了以下功能和udf:
def executeRowCount(fileCount: String): Long = {
val rowCount = spark.read.format("csv").option("header", "false").load(fileCount).count
rowCount
}
val execUdf = udf(executeRowCount _)
df.withColumn("row_count", execUdf (col("path"))).show()
这会导致以下错误
org.apache.spark.SparkException: Failed to execute user defined fu
nction($anonfun$1: (string) => bigint)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:253)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.NullPointerException
at $line39.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1.apply(<console>:28)
at $line39.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1.apply(<console>:25)
... 19 more
像
一样,我尝试遍历该列val te = df.select("path").as[String].collect()
te.foreach(executeRowCount)
在这里工作正常,但是我想将结果存储在df中...
我已经尝试了几种解决方案,但是我在这里面临死胡同。
答案 0 :(得分:2)
这不起作用,因为只能在驱动程序JVM中创建数据帧,而UDF代码在执行程序JVM中运行。您可以做的是将CSV加载到单独的数据框中,并使用文件名列丰富数据:
val csvs = spark
.read
.format("csv")
.load("/file/dir/")
.withColumn("filename", input_file_name())
,然后在df
列上加入原始filename
答案 1 :(得分:0)
那又如何呢? :
def executeRowCount = udf((fileCount: String) => {
spark.read.format("csv").option("header", "false").load(fileCount).count
})
df.withColumn("row_count", executeRowCount(col("path"))).show()
答案 2 :(得分:0)
可能是这样吗?
sqlContext
.read
.format("csv")
.load("/tmp/input/")
.withColumn("filename", input_file_name())
.groupBy("filename")
.agg(count("filename").as("record_count"))
.show()
答案 3 :(得分:0)
我通过以下方式解决了此问题:
val queue = df.select("path").as[String].collect()
val countResult = for (item <- queue) yield {
val rowCount = (item, spark.read.format("csv").option("header", "false").load(item).count)
rowCount
}
val df2 = spark.createDataFrame(countResult)
然后我与df2一起加入了df ...
问题出在udfs的驱动程序/工作程序体系结构中提到的@ ollik1。 UDF无法序列化,而我需要使用spark.read函数。