Question

当我使用pyspark创建一个新的数据框时，如下所示：

#creating the on/off signal column
df_zinc['action'] = 0

#creating the loop
for index,row in df_zinc.iterrows():
    if row.reversal == 1:
        df_zinc.loc[index,'action'] = 1
        if index<len(df_zinc.index)-126:             #the purpose of this condition is to not have the action column longer than the reversal column. Thuogh, it appears not to be working
            df_zinc.loc[index+126, 'action'] = -1
        index= index + 127

它可以工作，但是当我像这样创建数据帧

时

>>> l = [('a',1),('b',2)] 

>>> spark.createDataFrame(l)

我会收到一条错误消息：

>>> l = [(1),(2)]
>>> spark.createDataFrame(l)

即使我指定了架构，结果仍然相同：

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/tianlh/spark/python/pyspark/sql/session.py", line 526, in createDataFrame
    rdd, schema = self._createFromLocal(map(prepare, data), schema)
  File "/home/tianlh/spark/python/pyspark/sql/session.py", line 390, in _createFromLocal
    struct = self._inferSchemaFromList(data)
  File "/home/tianlh/spark/python/pyspark/sql/session.py", line 322, in _inferSchemaFromList
    schema = reduce(_merge_type, map(_infer_schema, data))
  File "/home/tianlh/spark/python/pyspark/sql/types.py", line 992, in _infer_schema
    raise TypeError("Can not infer schema for type: %s" % type(row))
TypeError: Can not infer schema for type: <type 'int'>

这会出错：

>>>schema=T.StructType([T.StructField('task_id',T.StringType(),True),T.StructField('task_id',T.IntegerType(),True)])

>>>l = [('a',1),('b',2)] 
>>>spark.createDataFrame(l,schema=schema).show()
+-------+-------+
|task_id|task_id|
+-------+-------+
|      a|      1|
|      b|      2|
+-------+-------+

我的问题是：为什么会这样呢？创建两个列和创建一个列有什么区别？为什么只创建单个列不起作用？

为什么pyspark无法创建单个列？

0 个答案: