Question

这会将所有文件中的所有数据加载到一个全面的数据框中。

df = sqlContext.read.format(
  'com.databricks.spark.csv'
).options(
  header='false',
  schema = customSchema
).load(fullPath)

fullPath是几个不同字符串的串联。无论如何，我以为我可以将文件名合并到sqlContext函数中，但是没有用。这给我一个错误。

df = sqlContext.read.format(
  'com.databricks.spark.csv'
).options(
  header='false',
  schema = customSchema,
  withColumn(
    "filename",
    input_file_name()
  )
).load(fullPath)

如何从多个数据集和文件名中加载所有内容？

这是错误消息：

SyntaxError: unexpected EOF while parsing
  File "<command-540264511625083>", line 43
    df = sqlContext.read.format('com.databricks.spark.csv').options(header='false', schema = customSchema, withColumn("filename", input_file_name()).load(fullPath)
                                                                                                                                                                    ^
SyntaxError: unexpected EOF while parsing

Answer 1

哦，我知道它现在是如何工作的。 withColumn函数位于结尾。这就是对我有用的。

df = sqlContext.read.format('com.databricks.spark.csv').options(header='false', schema = customSchema).load(fullPath).withColumn("filename",input_file_name())

此外，您需要在顶部添加正确的参考。

from  pyspark.sql.functions import input_file_name

如何在DataFrame中获取文件名？

1 个答案: