获取TypeError(" StructType不能接受类型%s"%(obj,type(obj))中的对象%r)

时间:2018-01-23 07:45:16

标签: python apache-spark dataframe pyspark pyspark-sql

我正在创建一个火花会话(火花版本2.2.1),如下所示

SparkS = SparkSession.builder\
    .appName("Test")\
    .master("local[*]")\
    .getOrCreate()

然后创建类似下面的sparkcontext

raw_data = SparkS\
    .sparkContext\
    .textFile("C:\\Users\\...\\RawData\\nasdaq.csv")

出于验证目的,我使用以下方式打印数据:

print(raw_data.take(3))

,输出

['43084,6871.549805,6945.819824,6871.450195,6936.580078,6936.580078,3510420000', '43087,6980.399902,7003.890137,6975.540039,6994.759766,6994.759766,2144360000', '43088,6991.25,6995.879883,6951.490234,6963.850098,6963.850098,2071060000']

现在我通过定义如下的模式将RDD转换为datafrme:

schema = StructType().add("date", StringType())\
                     .add("open", StringType())\
                     .add("high", StringType())\
                     .add("low", StringType())\
                     .add("close", StringType())\
                     .add("adj_close", StringType())\
                     .add("volume", StringType())

geioIP = SparkS.createDataFrame(raw_data,schema)
print(geioIP)

输出结果为:

DataFrame[date: string, open: string, high: string, low: string, close: string, adj_close: string, volume: string]

到目前为止一直很好,但问题是当我拨打geioIP.show(2)时,它会给我一个错误

18/01/23 12:58:48 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "C:\spark-2.2.1-bin-hadoop2.7\python\lib\pyspark.zip\pyspark\worker.py", line 177, in main
  File "C:\spark-2.2.1-bin-hadoop2.7\python\lib\pyspark.zip\pyspark\worker.py", line 172, in process
  File "C:\spark-2.2.1-bin-hadoop2.7\python\lib\pyspark.zip\pyspark\serializers.py", line 268, in dump_stream
    vs = list(itertools.islice(iterator, batch))
  File "C:\Users\rajnish.kumar\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\sql\session.py", line 520, in prepare
    verify_func(obj, schema)
  File "C:\spark-2.2.1-bin-hadoop2.7\python\lib\pyspark.zip\pyspark\sql\types.py", line 1371, in _verify_type
    raise TypeError("StructType can not accept object %r in type %s" % (obj, type(obj)))
TypeError: StructType can not accept object '43084,6871.549805,6945.819824,6871.450195,6936.580078,6936.580078,3510420000' in type <class 'str'>

完成此link后,我所做的是将所有csv数据转换为文本格式,但我仍然遇到问题。

3 个答案:

答案 0 :(得分:2)

问题是RDD中的每一行都是一个字符串(即一列),而您的模式包含7列。在您使用操作(如show)之前,RDD实际上并未转换为数据框,这就是它不会立即崩溃的原因。

由于您希望将数据放在数据框中,因此最简单的解决方案是在开始时将数据作为数据框读取:

geioIP = SparkS.read.csv("C:\\Users\\...\\RawData\\nasdaq.csv", schema=schema)

或者如果您想继续使用RDD和createDataFrame,可以使用split函数(如果有空格,可以使用strip。)

raw_data = raw_data.map(lambda x: [c.strip() for c in x.split(',')])
geioIP = SparkS.createDataFrame(raw_data,schema)

答案 1 :(得分:1)

嗨,请大家向@ Shaido指出关于RDD的最基本的事情&#34; RDD中的每一行都是一个字符串(即一列),你的架构包含7列。&#34; 并且在post我能够解决上述问题

直接在

中使用 raw_data 之前
geioIP = SparkS.createDataFrame(raw_data,schema)

我需要创建一个RDD列表,我喜欢这个

rawdata = raw_data.map(lambda x : x.split(","))

现在正在呼叫

geioIP = SparkS.createDataFrame(rawdata,schema)
geioIP.show(2)

产量

+-----+-----------+-----------+-----------+-----------+-----------+----------+
| date|       open|       high|        low|      close|  adj_close|    volume|
+-----+-----------+-----------+-----------+-----------+-----------+----------+
|43084|6871.549805|6945.819824|6871.450195|6936.580078|6936.580078|3510420000|
|43087|6980.399902|7003.890137|6975.540039|6994.759766|6994.759766|2144360000|
+-----+-----------+-----------+-----------+-----------+-----------+----------+
only showing top 2 rows

答案 2 :(得分:-1)

将原始数据转换为此: [raw_data中x的行(x)]