无法使用pyspark从json dstream创建数据框

时间:2016-07-27 11:06:16

标签: python json apache-spark pyspark dstream

我试图在dstream中从json创建一个数据帧,但下面的代码似乎没有帮助使数据框正确 -

import sys
import json
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
from pyspark.sql import SQLContext
def getSqlContextInstance(sparkContext):
    if ('sqlContextSingletonInstance' not in globals()):
        globals()['sqlContextSingletonInstance'] = SQLContext(sparkContext)
    return globals()['sqlContextSingletonInstance']

if __name__ == "__main__":
    if len(sys.argv) != 3:
        raise IOError("Invalid usage; the correct format is:\nquadrant_count.py <hostname> <port>")

# Initialize a SparkContext with a name
spc = SparkContext(appName="jsonread")
sqlContext = SQLContext(spc)
# Create a StreamingContext with a batch interval of 2 seconds
stc = StreamingContext(spc, 2)
# Checkpointing feature
stc.checkpoint("checkpoint")
# Creating a DStream to connect to hostname:port (like localhost:9999)
lines = stc.socketTextStream(sys.argv[1], int(sys.argv[2]))
lines.pprint()
parsed = lines.map(lambda x: json.loads(x))
def process(time, rdd):
    print("========= %s =========" % str(time))
    try:
        # Get the singleton instance of SQLContext
        sqlContext = getSqlContextInstance(rdd.context)
        # Convert RDD[String] to RDD[Row] to DataFrame
        rowRdd = rdd.map(lambda w: Row(word=w))
        wordsDataFrame = sqlContext.createDataFrame(rowRdd)
        # Register as table
        wordsDataFrame.registerTempTable("mytable")
        testDataFrame = sqlContext.sql("select summary from mytable")
        print(testDataFrame.show())
        print(testDataFrame.printSchema())
    except:
        pass
parsed.foreachRDD(process)
stc.start()
# Wait for the computation to terminate
stc.awaitTermination()

没有错误,但是当脚本运行时,它确实从流式上下文中成功读取了json,但它不会在摘要或数据帧架构中打印值。

示例json我试图阅读 -

{&#34; reviewerID&#34;:&#34; A2IBPI20UZIR0U&#34;,&#34; asin&#34;:&#34; 1384719342&#34;,&#34; reviewerName&#34;:& #34; cassandra tu \&#34;是的,嗯,那就像,你...&#34;,&#34;有用&#34;:[0,0],&#34; reviewText&#34;:&#34;在这里写的并不多,但它完全符合它的预期。过滤掉流行音乐。现在我的录音更加清晰。它是亚马逊上最低价格的流行过滤器之一,所以不妨买它,尽管价格合理,但他们老老实实地做同样的事情,&#34;,&#34;总体&#34;:5.0,&#34;摘要&#34 ;:&#34; good&#34;,&#34; unixReviewTime&#34;:1393545600,&#34; reviewTime&#34;:&#34; 02 28,2014&#34;}

我是一个非常新的角色,可以激发流媒体,并通过阅读文档开始从事宠物项目。非常感谢任何帮助和指导。

0 个答案:

没有答案