SqlContext导入并并行化Pyspark时出错

时间:2018-03-19 11:29:43

标签: apache-spark dataframe pyspark rdd

line = "Hello, world"
sc.parallelize(list(line)).collect()

我收到以下错误

TypeError: parallelize() missing 1 required positional argument: 'c'

从仅包含一列的字符串列表创建数据框时,我还有另一个问题:

from pyspark.sql.types import *
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
schema = StructType([StructField("name", StringType(), True)])
df3 = sqlContext.createDataFrame(fuzzymatchIntro, schema)
df3.printSchema()

我收到以下错误:

----> 3 sqlContext = SQLContext(sc)
AttributeError: type object 'SparkContext' has no attribute '_jsc'

提前谢谢

1 个答案:

答案 0 :(得分:0)

查看上面的评论,您似乎已经以错误的方式初始化了sparkContext

  

from pyspark.context import SparkContext from pyspark.sql.session import SparkSession sc = SparkContext spark = SparkSession.builder.appName("DFTest").getOrCreate()

正确的方法是

from pyspark.sql.session import SparkSession
spark = SparkSession.builder.appName("DFTest").getOrCreate()
sc = spark.sparkContext

spark对象可以完成sqlContext 的工作