Question

我想解析一个Excel文件。该文件具有很少的字段值作为时间戳格式for row in list_reader: media_id = row['mediaId'] external_id = row['externalId'] with open('10-17_res1.csv', 'a') as results_file: file_is_empty = os.stat('10-17_res1.csv').st_size == 0 results_writer = csv.writer( results_file, delimiter = ',', quotechar = '"' ) if file_is_empty: results_writer.writerow(['fileURL','key', 'mediaId','externalId']) key = 'corpora/' + external_id + '/' + external_id + '.flac' bucketname = 'my_bucket' media_stream = media.get_item(media_id) stream_url = media_stream['streams'][0]['streamLocation'] fake_handle = StringIO(stream_url) s3c.put_object(Bucket=bucketname, Key=key, Body=fake_handle.read()) 我已将字段类型定义为时间戳，但是我的应用程序无法识别数据类型并且无法加载数据，尽管如果我将StringType用作数据类型，则它能够解析文件，但我不想使用这种替代方法。因此寻找正确的解决方案。我的代码如下：

("dd-MMM-yy hh:mm:ss:SSSSSSSSS aa")

采样日期数据：ReadExcel("C:path\to\the\raw_file\Consignments.xlsx", "A1", MySchema, spark, "dd-MM-yyyy", "dd-MMM-yy hh:mm:ss:SSSSSSSSS aa") def ReadExcel(path: String, dataAddress: String = "A2", Schema: StructType, spark: org.apache.spark.sql.SparkSession, datefmt: String = "dd-MM-yyyy", tsfmt: String = "dd-MM-yyyy HH:mm:ss"): DataFrame = { /** * Though Crealytics accept TimestampFormat Only * You can Create CustomSchema with DateType and Date values in data will be typed to Date */ cleanHeaders(spark.read .format("com.crealytics.spark.excel") .option("dataAddress", dataAddress) // .option("useHeader", "false") // Required .option("treatEmptyValuesAsNulls", "true") // Optional, default: true .option("inferSchema", "false") // Optional, default: false .option("addColorColumns", "false") // Optional, default: false .option("timestampFormat", "dd-MM-yyyy HH:mm:ss") // Optional, default: yyyy-mm-dd hh:mm:ss[.fffffffff] //.option("maxRowsInMemory", 20) // Optional, default None. If set, uses a streaming reader which can help with big files //.option("excerptSize", 10) // Optional, default: 10. If set and if schema inferred, number of rows to infer schema from .schema(Schema) .load(path)) } 请注意：我正在使用Databricks笔记本和crealytics库读取excel文件。

Answer 1

@金星。我认为时间戳表示不正确。转换为天的873000000毫秒超过10天。我认为您只需要考虑毫秒的前3位数字。请检查。

在这种情况下，您可以按照以下方法进行操作：

首先读取文件的子字符串，以减少timestamp列中毫秒的前3位。
然后使用火花投射，主要是在 withColumn 方法内，然后使用 from_unixtime（unix_time（column，'timestamp format'），'format'）

Spark（Scala）解析时间戳格式为（“ dd-MMM-yy hh：mm：ss：SSSSSSSSS aa”）的字段

1 个答案: