pyspark createdataframe:字符串被解释为时间戳,架构混合列

时间:2017-02-03 14:05:05

标签: apache-spark pyspark apache-spark-sql pyspark-sql

我对spark数据帧有一个非常奇怪的错误,导致字符串被评估为时间戳。

这是我的设置代码:

from datetime import datetime
from pyspark.sql import Row
from pyspark.sql.types import StructType, StructField, StringType, TimestampType

new_schema = StructType([StructField("item_id", StringType(), True),
                         StructField("date", TimestampType(), True),
                         StructField("description", StringType(), True)
                        ])

df = sqlContext.createDataFrame([Row(description='description', date=datetime.utcnow(), item_id='id_string')], new_schema)

这给了我以下错误:

  

AttributeError Traceback(最近一次调用   最后)in()   ----> 1 df = sqlContext.createDataFrame([Row(描述='嘿',date = datetime.utcnow(),item_id =' id_string')],new_schema)

     

/home/florian/spark/python/pyspark/sql/context.pyc in   createDataFrame(self,data,schema,samplingRatio,verifySchema)       307 Py4JJavaError:...       308"""    - > 309返回self.sparkSession.createDataFrame(data,schema,samplingRatio,verifySchema)       310       311 @since(1.3)

     

/home/florian/spark/python/pyspark/sql/session.pyc in   createDataFrame(self,data,schema,samplingRatio,verifySchema)       522 rdd,schema = self._createFromRDD(data.map(prepare),schema,samplingRatio)       523其他:    - > 524 rdd,schema = self._createFromLocal(map(prepare,data),schema)       525 jrdd = self._jvm.SerDeUtil.toJavaArray(rdd._to_java_object_rdd())       526 jdf = self._jsparkSession.applySchemaToPythonRDD(jrdd.rdd(),schema.json())

     

/home/florian/spark/python/pyspark/sql/session.pyc in   _createFromLocal(self,data,schema)       397       398#将python对象转换为sql数据    - > 399 data = [schema.toInternal(row)for data in row]       400返回self._sc.parallelize(data),schema       401

      /home/florian/spark/python/pyspark/sql/types.pyc in toInternal(self,   OBJ)       574返回元组(f.toInternal(obj.get(n))表示n,f表示zip(self.names,self.fields))       575 elif isinstance(obj,(tuple,list)):    - > 576返回元组(f.toInternal(v)表示f,v表示zip(self.fields,obj))       577 elif hasattr(obj," dict "):       578 d = obj。 dict

     /(f,v)中的/ home /florian/spark/python/pyspark/sql/types.pyc       574返回元组(f.toInternal(obj.get(n))表示n,f表示zip(self.names,self.fields))       575 elif isinstance(obj,(tuple,list)):    - > 576返回元组(f.toInternal(v)表示f,v表示zip(self.fields,obj))       577 elif hasattr(obj," dict "):       578 d = obj。 dict

      /home/florian/spark/python/pyspark/sql/types.pyc in toInternal(self,   OBJ)       434       435 def toInternal(self,obj):    - > 436返回self.dataType.toInternal(obj)       437       438 def fromInternal(self,obj):

      /home/florian/spark/python/pyspark/sql/types.pyc in toInternal(self,   DT)       188 def toInternal(self,dt):       189如果dt不是None:    - > 190秒=(calendar.timegm(dt.utctimetuple())如果是dt.tzinfo       191 else time.mktime(dt.timetuple()))       192 return int(seconds * 1e6 + dt.microsecond)

     

属性错误:' str'对象没有属性' tzinfo'

这看起来好像是一个字符串被传递给TimestampType.toInternal()

真的很奇怪的是这个数据框会产生同样的错误:

df = sqlContext.createDataFrame([Row(description='hey', date=None, item_id='id_string')], new_schema)

虽然这个有效:

df = sqlContext.createDataFrame([Row(description=None, date=datetime.now(), item_id='id_string')], new_schema)

这个也适用:

df = sqlContext.createDataFrame([Row(description=None, date=datetime.now(), item_id=None)], new_schema)

对我来说,现在这意味着pyspark以某种方式将值从" item_id"进入专栏" date"因此会产生此错误。 我做错什么了吗?这是数据框中的错误吗?

信息: 我正在使用pyspark 2.0.1

编辑:

df = sqlContext.createDataFrame([Row(description=None, date=datetime.now(), item_id=None)], new_schema)
df.first()
  

行(ITEM_ID = U' java.util.GregorianCalendar中[时间= ?, areFieldsSet =假,areAllFieldsSet =假,宽大=真,区= sun.util.calendar.ZoneInfo [ID ="等/ UTC",偏移= 0,dstSavings = 0,useDaylight =假,过渡= 0,lastRule =空],Firstdayofweek可= 1,minimalDaysInFirstWeek = 1,ERA = ?, YEAR = 2017,月= 1,WEEK_OF_YEAR =? ,WEEK_OF_MONTH = ?, DAY_OF_MONTH = 3,DAY_OF_YEAR = ?, DAY_OF_WEEK = ?, DAY_OF_WEEK_IN_MONTH = ?, AM_PM = 1,HOUR = 3,HOUR_OF_DAY = 15,MINUTE = 19,SECOND = 30,微差= 85,ZONE_OFFSET = ?, DST_OFFSET =?]&#39 ;,   date = None,description = None)

2 个答案:

答案 0 :(得分:6)

创建Row对象时,字段按字母顺序排序(http://spark.apache.org/docs/2.0.1/api/python/pyspark.sql.html#pyspark.sql.Row),因此在创建Row(description, date, item_id)对象时,它将按(date, description, item_id)排序。

由于您的架构按StringType, TimestampType, StringType排序,因此在使用此行和架构创建DataFrame时,Spark会将date中的内容映射到StringType,{{1} {}为descriptionTimestampTypeitem_id

将时间戳(以StringType格式)传递给datetime不会导致错误,但将字符串传递给StringType会发生错误,因为它要求TimestampType }属性,如错误所述,String对象没有它。

此外,为您工作的数据帧实际工作的原因是因为tzinfo被传递到架构中的None,这是可接受的值。

答案 1 :(得分:2)

基于@ rafael-zanetti的上述答案。您可以执行以下操作对列进行排序:

new_schema = [StructField("item_id", StringType(), True),
                     StructField("date", TimestampType(), True),
                     StructField("description", StringType(), True)]
new_schema = StructType(sorted(new_schema, key=lambda f: f.name))