为什么pyspark无法创建单个列?

时间:2019-03-20 07:04:34

标签: apache-spark pyspark apache-spark-sql pyspark-sql

当我使用pyspark创建一个新的数据框时,如下所示:

#creating the on/off signal column
df_zinc['action'] = 0

#creating the loop
for index,row in df_zinc.iterrows():
    if row.reversal == 1:
        df_zinc.loc[index,'action'] = 1
        if index<len(df_zinc.index)-126:             #the purpose of this condition is to not have the action column longer than the reversal column. Thuogh, it appears not to be working
            df_zinc.loc[index+126, 'action'] = -1
        index= index + 127

它可以工作,但是当我像这样创建数据帧

>>> l = [('a',1),('b',2)] 

>>> spark.createDataFrame(l)

我会收到一条错误消息:

>>> l = [(1),(2)]
>>> spark.createDataFrame(l)

即使我指定了架构,结果仍然相同:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/tianlh/spark/python/pyspark/sql/session.py", line 526, in createDataFrame
    rdd, schema = self._createFromLocal(map(prepare, data), schema)
  File "/home/tianlh/spark/python/pyspark/sql/session.py", line 390, in _createFromLocal
    struct = self._inferSchemaFromList(data)
  File "/home/tianlh/spark/python/pyspark/sql/session.py", line 322, in _inferSchemaFromList
    schema = reduce(_merge_type, map(_infer_schema, data))
  File "/home/tianlh/spark/python/pyspark/sql/types.py", line 992, in _infer_schema
    raise TypeError("Can not infer schema for type: %s" % type(row))
TypeError: Can not infer schema for type: <type 'int'>

这会出错:

>>>schema=T.StructType([T.StructField('task_id',T.StringType(),True),T.StructField('task_id',T.IntegerType(),True)])

>>>l = [('a',1),('b',2)] 
>>>spark.createDataFrame(l,schema=schema).show()
+-------+-------+
|task_id|task_id|
+-------+-------+
|      a|      1|
|      b|      2|
+-------+-------+

我的问题是: 为什么会这样呢? 创建两个列和创建一个列有什么区别? 为什么只创建单个列不起作用?

0 个答案:

没有答案