Question

我正在使用数据块，并试图读取这样的csv文件：

df = (spark.read      
  .option("header", "true")
  .option("inferSchema", "true")
  .csv(path_to_my_file)
)

我得到了错误：

AnalysisException: 'Unable to infer schema for CSV. It must be specified manually.;'

我检查了我的文件是否为空，并且还尝试自己指定架构，如下所示：

schema = "datetime timestamp, id STRING, zone_id STRING, name INT, time INT, a INT"
df = (spark.read      
  .option("header", "true")
  .schema(schema)
  .csv(path_to_my_file)
)

但是当尝试使用display（df）看到它时，它只是在下面给了我，我完全迷失了，不知道该怎么办。

df.show() and df.printSchema()提供以下内容：

好像没有将数据读入数据框。

错误快照：

Answer 1

请注意，这是一个不完整的答案，因为关于文件的外观没有足够的信息来了解为什么inferSchema无法正常工作。我把这个回复作为回答，因为它太长了，不能发表评论。

这样说，要以编程方式指定架构，您需要使用StructType()来指定架构。

使用您的示例 datetime timestamp, id STRING, zone_id STRING, name INT, time INT, mod_a INT"

它看起来像这样：

# Import data types
from pyspark.sql.types import *

schema = StructType(
   [StructField('datetime', TimestampType(), True),
    StructField('id', StringType(), True),
    StructField('zone_id', StringType(), True),
    StructField('name', IntegerType(), True),
    StructField('time', IntegerType(), True),
    StructField('mod_a', IntegerType(), True)
   ]
  )

请注意，df.printSchema()如何指定所有列均为数据类型字符串。

无法推断pyspark中CSV的架构

1 个答案: