Question

我是Spark的前身。我正在关注PySpark的视频课程。我正在尝试使用下面的代码将json字符串转换为数据帧。

    import pyspark as ps
    from pyspark.sql import HiveContext # to interface dataframe API

    sc = ps.SparkContext()
    hive_context = HiveContext(sc)

    # some code ....  and build meals_json
    meals_json.take(1) # below is output of this code
    #['{"meal_id": 1, "dt": "2013-01-01", "type": "french", "price": 10}']
    # some more code

    meals_dataframe = hive_context.jsonRDD(meals_json)
    meals_dataframe.first()

在开始运行最后一行时，我的误差低于此值。

    AttributeError     Traceback (most recent call last)
    <ipython-input-19-43e4f3006ac3> in <module>()
    ----> 1 meals_dataframe = hive_context.jsonRDD(meals_json)
          2 meals_dataframe.first()

    AttributeError: 'HiveContext' object has no attribute 'jsonRDD'

我搜索了网络，我无法找到讨论此问题的任何资源。我在使用Python 3.5的jupyter笔记本上使用Spark 2.1.1运行此代码。

从文档中，我可以看到jsonRDD是继承自类org.apache.spark.sql.SQLContext 。我不太确定，可能是什么原因。任何建议都会有所帮助。感谢。

Answer 1

sqlContext.jsonRDD是deprecated。从1.4.0开始，它已被read().json()取代。我在下面列举了一个适用于Spark 2.1.1的示例

import json
from pyspark.sql.types import StructField, StructType, IntegerType, StringType

r = [{'a': 'aaa', 'b': 'bbb', 'c': 'ccc'}, 
       {'a': 'aaaa','b': 'bbbb','c': 'cccc','d': 'dddd'}] 
r = [json.dumps(d) for d in r] 

# known schema 
schema = ['a', 'b', 'c', 'd'] 
fields = [StructField(field_name, StringType(), True) for field_name in schema] 
schema = StructType(fields) 

rdd = sc.parallelize(r) 
df = sqlContext.read.schema(schema).json(rdd)
df.collect()

这在Spark 2.1.1上提供了以下输出：

[Row(a=u'aaa', b=u'bbb', c=u'ccc', d=None),
 Row(a=u'aaaa', b=u'bbbb', c=u'cccc', d=u'dddd')]

请注意，此代码段的第一部分受到Apache Spark用户列表中this question的启发

'HiveContext'对象没有属性'jsonRDD'Spark 2.1.1

1 个答案: