Question

我正在测试此代码。

from  pyspark.sql.functions import input_file_name
from pyspark.sql import SQLContext
from pyspark.sql.types import *
sqlContext = SQLContext(sc)


customSchema = StructType([ \
StructField("id", StringType(), True), \
StructField("date", StringType(), True), \
etc., etc., etc.
StructField("filename", StringType(), True)])



fullPath = "path_and_credentials_here"
df = sqlContext.read.format('com.databricks.spark.csv').options(header='false', schema = customSchema, delimiter='|').load(fullPath).withColumn("filename",input_file_name())

df.show()

现在，我的数据是用管道定界的，并且第一行包含一些元数据，这些元数据也是用管道定界的。奇怪的是，自定义架构实际上被忽略了。文件的第一行中的元数据控制着该模式，而不是应用我的自定义模式，这是完全错误的。这是我看到的视图。

+------------------+----------+------------+---------+--------------------+
|               _c0|       _c1|         _c2|      _c3|            filename|
+------------------+----------+------------+---------+--------------------+
|                CP|  20190628|    22:41:58|   001586|   abfss://rawdat...|
|          asset_id|price_date|price_source|bid_value|   abfss://rawdat...|
|             2e58f|  20190628|         CPN|  108.375|   abfss://rawdat...|
|             2e58f|  20190628|         FNR|     null|   abfss://rawdat...|

etc., etc., etc.

如何获取自定义架构？

Answer 1

您遇到的问题是因为您使用的是旧的（不再维护）的CSV阅读器。请参阅标题of the package下的免责声明。

如果您尝试使用新格式，它将起作用：

.date()

Pyspark未选择自定义架构

1 个答案: