Question

我正在从Excel中读取excel表格，我需要将这些数据存储为HDFS中的json。对于某些表格，我面临例外

excel_file = pd.ExcelFile("export_n_moreExportData10846.xls")
for sheet_name in excel_file.sheet_names:
df = pd.read_excel(excel_file, header=None, squeeze=True, sheet_name=sheet_name)
if sheet_name=='Passed':
    print '**************' + sheet_name + '******************'
    for i, row in df.iterrows():
        data = df.iloc[(i+1):].reset_index(drop=True)
        data.columns = pd.Series(list(df.iloc[i])).str.replace(' ','_')
        break

    for c in data.columns:
        data[c] = pd.to_numeric(data[c], errors='ignore')
    print data #I'm able to print the data

    result1 = sparkSession.createDataFrame(data) #Facing the exception here
    print "inserting data into HDFS..."
    result1.write.mode("append").json(hdfsPath)
    print "inserted data into hdfs"

我面临以下异常

raise TypeError("Can not merge type %s and %s" % (type(a), type(b)))
TypeError: Can not merge type <class 'pyspark.sql.types.StringType'> and <class 'pyspark.sql.types.DoubleType'>

图像显示数据

Answer 1

这可能是因为有些列在同一列中有不同的数据类型，pandas可以处理（'object'类型），spark df不能。

处理这个问题的几种方法：

你可以跳过spark df阶段，将pandas df反转为dicts（df.to_dict（orient ='records'）并将其读取到RDD并保存（考虑使用json加载转换为正确的jsons）转储）。
将对象列转换为字符串（df [col] = df [col] .astype（str））。

取决于你想要什么。

对于这个data.fillna（'0'，inplace = True），因为列有空记录。

阅读excel表时出现例外情况

1 个答案: