在使用pyspark编写orc文件时,如何避免解析错误?

时间:2019-03-29 21:14:52

标签: python python-3.x pyspark aws-glue

我正在读取两个orc格式的数据帧(文件已经在orc中)。然后,我以orc格式写入另一个s3路径,即第一个数据帧。它将引发解析异常错误以及数据的列名。

我浏览了一些问题,但是他们有一个与蜂巢有关的答案,而没有一个纯粹的pyspark问题得到了回答。

from athena_processing import AthenaContext
from datetime import datetime
from awsglue.utils import getResolvedOptions
import sys

athena_context = AthenaContext()
# spark context is already created for you as athena_context.sc, spark session is already created for you as athena_context.spark
# include this line in your code: spark = athena_context.spark

spark = athena_context.spark

# get two input names


# get output name
#output_path = getResolvedOptions(sys.argv, ['output_path'])['output_path']

# df = athena_context.input('S3')
# df2 = athena_context.input('S3_1')
df = spark.read.orc('s3://my-bucket/pipelines/march302_AkdmcrCTp/manual__2019-03-29T20_19_22+00_00/aws-s3_8LI6nrCTp')
df2 = spark.read.orc('s3://my-bucket/pipelines/march302_AkdmcrCTp/manual__2019-03-29T20_19_22+00_00/aws-s3_-K5Rn9CHM')
# df = spark.read.csv(
#     's3://my-bucket/one_billion_hundred_columns.csv',
#     header=True
# )
# print("First dataframe read with headers set to True")

# df2 = athena_context.input('5c9e2249186c5d0006b36fae_manual__2019_03_29T13_52_15_00_00')
# df2 = spark.read.orc(
#     's3://my-bucket/opendata/gdelt/gdelt_100m'
# )

# df2 = spark.read.csv(
#     's3://my-bucket/one_billion_hundred_columns.csv',
#     header=True
# )

print("Second dataframe read with headers set to True")

# Obtain columns lists
left_cols = df.columns
right_cols = df2.columns

# Prefix each dataframe's field with "left_" or "right_"
df = df.selectExpr([col + ' as left_' + col for col in left_cols])
df2 = df2.selectExpr([col + ' as right_' + col for col in right_cols])

# Perform join
# df3 = df.alias('l').join(df2.alias('r'), on='l.left_c_0' == 'r.right_c_0')

# df3 = df.alias('l').join(df2.alias('r'), on='c_0')

# df3 = df.join(
#     df2,
#     df["left_column_test_0"] == df2["right_column_test_0"]
# )

# df3 = df.join(
#     df2,
#     df["left__c0"] == df2["right__c0"]
# )

# df3 = df3.repartition("left_column_test_0")

print("Dataframes have been joined and the resultant dataframe has been partitioned successfully.")

# output_file_path = 's3://my-bucket/temp-de-avi/output_temp_de_avi/gdelt_100m/march_26_job/_1'
output_file_path = 's3://my-bucket/pipelines/march302_AkdmcrCTp/manual__2019-03-29T20_19_22+00_00/processing_1wuRc9CHp'
print(output_file_path)
df.write.orc(
    output_file_path
)


# print("Dataframe has been written to csv.")
athena_context.commit()

我希望成功编写orc文件,但是相反,我从AWS Glue中收到以下错误:

ParseException: u"\nextraneous input 'type' expecting ':'(line 1, pos 41)\n\n== SQL ==\nstruct<region:string,country:string,item type:string,sales channel:string,order priority:string,order date:timestamp,order id:int,ship date:timestamp,units sold:int,unit price:double,unit cost:double,total revenue:double,total cost:double,total profit:double>\n-----------------------------------------^^^\n"```

0 个答案:

没有答案