由于没有开箱即用支持在spark中读取excel文件,所以我首先将excel文件首先读入pandas数据帧,然后尝试将pandas数据帧转换为spark数据帧但是我得到了以下错误 (我正在使用spark 1.5.1)
import pandas as pd
from pandas import ExcelFile
from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.sql.types import *
pdf=pd.read_excel('/home/testdata/test.xlsx')
df = sqlContext.createDataFrame(pdf)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/opt/spark/spark-hadoop/python/pyspark/sql/context.py", line 406, in createDataFrame
rdd, schema = self._createFromLocal(data, schema)
File "/opt/spark/spark-hadoop/python/pyspark/sql/context.py", line 337, in _createFromLocal
data = [schema.toInternal(row) for row in data]
File "/opt/spark/spark-hadoop/python/pyspark/sql/types.py", line 541, in toInternal
return tuple(f.toInternal(v) for f, v in zip(self.fields, obj))
File "/opt/spark/spark-hadoop/python/pyspark/sql/types.py", line 541, in <genexpr>
return tuple(f.toInternal(v) for f, v in zip(self.fields, obj))
File "/opt/spark/spark-hadoop/python/pyspark/sql/types.py", line 435, in toInternal
return self.dataType.toInternal(obj)
File "/opt/spark/spark-hadoop/python/pyspark/sql/types.py", line 191, in toInternal
else time.mktime(dt.timetuple()))
AttributeError: 'datetime.time' object has no attribute 'timetuple'
有人知道如何解决它吗?
答案 0 :(得分:1)
我最好的猜测,当您使用Pandas读取数据时,您的问题是“错误地”解析datetime
数据
以下代码“正常”:
import pandas as pd
from pandas import ExcelFile
from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.sql.types import *
pdf = pd.read_excel('test.xlsx', parse_dates=['Created on','Confirmation time'])
sc = SparkContext()
sqlContext = SQLContext(sc)
sqlContext.createDataFrame(data=pdf).collect()
[Row(Customer=1000935702, Country='TW', ...
请注意,您的另一个日期时间列'Confirmation date'
在您的示例中由NaT
组成,因此使用您的简短样本读取RDD没有问题,但是您是否恰好在那里有一些数据在完整的数据集中,您还必须关注该列。