Question

我对spark数据帧有一个非常奇怪的错误，导致字符串被评估为时间戳。

这是我的设置代码：

from datetime import datetime
from pyspark.sql import Row
from pyspark.sql.types import StructType, StructField, StringType, TimestampType

new_schema = StructType([StructField("item_id", StringType(), True),
                         StructField("date", TimestampType(), True),
                         StructField("description", StringType(), True)
                        ])

df = sqlContext.createDataFrame([Row(description='description', date=datetime.utcnow(), item_id='id_string')], new_schema)

这给了我以下错误：

AttributeError Traceback（最近一次调用   最后）in（）   ----＆GT; 1 df = sqlContext.createDataFrame（[Row（描述=＆＃39;嘿＆＃39;，date = datetime.utcnow（），item_id =＆＃39; id_string＆＃39;）]，new_schema）

/home/florian/spark/python/pyspark/sql/context.pyc in   createDataFrame（self，data，schema，samplingRatio，verifySchema）       307 Py4JJavaError：...       308＆＃34;＆＃34;＆＃34;    - ＆GT; 309返回self.sparkSession.createDataFrame（data，schema，samplingRatio，verifySchema）       310       311 @since（1.3）

/home/florian/spark/python/pyspark/sql/session.pyc in   createDataFrame（self，data，schema，samplingRatio，verifySchema）       522 rdd，schema = self._createFromRDD（data.map（prepare），schema，samplingRatio）       523其他：    - ＆GT; 524 rdd，schema = self._createFromLocal（map（prepare，data），schema）       525 jrdd = self._jvm.SerDeUtil.toJavaArray（rdd._to_java_object_rdd（））       526 jdf = self._jsparkSession.applySchemaToPythonRDD（jrdd.rdd（），schema.json（））

/home/florian/spark/python/pyspark/sql/session.pyc in   _createFromLocal（self，data，schema）       397       398＃将python对象转换为sql数据    - ＆GT; 399 data = [schema.toInternal（row）for data in row]       400返回self._sc.parallelize（data），schema       401
      /home/florian/spark/python/pyspark/sql/types.pyc in toInternal（self，   OBJ）       574返回元组（f.toInternal（obj.get（n））表示n，f表示zip（self.names，self.fields））       575 elif isinstance（obj，（tuple，list））：    - ＆GT; 576返回元组（f.toInternal（v）表示f，v表示zip（self.fields，obj））       577 elif hasattr（obj，＆＃34; dict ＆＃34;）：       578 d = obj。 dict
     /（f，v）中的/ home /florian/spark/python/pyspark/sql/types.pyc       574返回元组（f.toInternal（obj.get（n））表示n，f表示zip（self.names，self.fields））       575 elif isinstance（obj，（tuple，list））：    - ＆GT; 576返回元组（f.toInternal（v）表示f，v表示zip（self.fields，obj））       577 elif hasattr（obj，＆＃34; dict ＆＃34;）：       578 d = obj。 dict
      /home/florian/spark/python/pyspark/sql/types.pyc in toInternal（self，   OBJ）       434       435 def toInternal（self，obj）：    - ＆GT; 436返回self.dataType.toInternal（obj）       437       438 def fromInternal（self，obj）：
      /home/florian/spark/python/pyspark/sql/types.pyc in toInternal（self，   DT）       188 def toInternal（self，dt）：       189如果dt不是None：    - ＆GT; 190秒=（calendar.timegm（dt.utctimetuple（））如果是dt.tzinfo       191 else time.mktime（dt.timetuple（）））       192 return int（seconds * 1e6 + dt.microsecond）

属性错误：＆＃39; str＆＃39;对象没有属性＆＃39; tzinfo＆＃39;

这看起来好像是一个字符串被传递给TimestampType.toInternal（）

真的很奇怪的是这个数据框会产生同样的错误：

df = sqlContext.createDataFrame([Row(description='hey', date=None, item_id='id_string')], new_schema)

虽然这个有效：

df = sqlContext.createDataFrame([Row(description=None, date=datetime.now(), item_id='id_string')], new_schema)

这个也适用：

df = sqlContext.createDataFrame([Row(description=None, date=datetime.now(), item_id=None)], new_schema)

对我来说，现在这意味着pyspark以某种方式将值从＆＃34; item_id＆＃34;进入专栏＆＃34; date＆＃34;因此会产生此错误。我做错什么了吗？这是数据框中的错误吗？

信息：我正在使用pyspark 2.0.1

编辑：

df = sqlContext.createDataFrame([Row(description=None, date=datetime.now(), item_id=None)], new_schema)
df.first()

行（ITEM_ID = U＆＃39; java.util.GregorianCalendar中[时间= ?, areFieldsSet =假，areAllFieldsSet =假，宽大=真，区= sun.util.calendar.ZoneInfo [ID =＆＃34;等/ UTC＆＃34;，偏移= 0，dstSavings = 0，useDaylight =假，过渡= 0，lastRule =空]，Firstdayofweek可= 1，minimalDaysInFirstWeek = 1，ERA = ?, YEAR = 2017，月= 1，WEEK_OF_YEAR =？，WEEK_OF_MONTH = ?, DAY_OF_MONTH = 3，DAY_OF_YEAR = ?, DAY_OF_WEEK = ?, DAY_OF_WEEK_IN_MONTH = ?, AM_PM = 1，HOUR = 3，HOUR_OF_DAY = 15，MINUTE = 19，SECOND = 30，微差= 85，ZONE_OFFSET = ?, DST_OFFSET =？]＆＃39 ;, date = None，description = None）

Answer 1

创建Row对象时，字段按字母顺序排序（http://spark.apache.org/docs/2.0.1/api/python/pyspark.sql.html#pyspark.sql.Row），因此在创建Row(description, date, item_id)对象时，它将按(date, description, item_id)排序。

由于您的架构按StringType, TimestampType, StringType排序，因此在使用此行和架构创建DataFrame时，Spark会将date中的内容映射到StringType，{{1} {}为description，TimestampType为item_id。

将时间戳（以StringType格式）传递给datetime不会导致错误，但将字符串传递给StringType会发生错误，因为它要求TimestampType }属性，如错误所述，String对象没有它。

此外，为您工作的数据帧实际工作的原因是因为tzinfo被传递到架构中的None，这是可接受的值。

Answer 2

基于@ rafael-zanetti的上述答案。您可以执行以下操作对列进行排序：

new_schema = [StructField("item_id", StringType(), True),
                     StructField("date", TimestampType(), True),
                     StructField("description", StringType(), True)]
new_schema = StructType(sorted(new_schema, key=lambda f: f.name))

pyspark createdataframe：字符串被解释为时间戳，架构混合列

2 个答案: