我使用python 2和spark。我按照说明如何在此链接https://github.com/Ruthvicp/CS5590_BigDataProgramming/wiki/Lab-Assignment-4----Spark-MLlib-classification-algorithms,-word-count-on-twitter-streaming上计算Twitter上的单词数 我有2个档案 TSWordCount
import findspark
findspark.init()
import pyspark
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
from pyspark.sql.functions import desc
from collections import namedtuple
import os
os.environ["SPARK_HOME"] = "C:\\spark-2.3.1-bin-hadoop2.7\\spark-2.3.1-bin-hadoop2.7"
os.environ["HADOOP_HOME"] = "C:\\winutils\\"
def main():
sc =SparkContext(appName="Countwords1234")
wordcount = {}
ssc = StreamingContext(sc, 5)
lines = ssc.socketTextStream("localhost", 5678)
fields = ("word", "count")
Tweet = namedtuple('Text', fields)
# lines = socket_stream.window(20)
counts = lines.flatMap(lambda text: text.split(" "))\
.map(lambda x: (x, 1))\
.reduceByKey(lambda a, b: a + b).map(lambda rec: Tweet(rec[0], rec[1]))
counts.pprint()
ssc.start()
ssc.awaitTermination()
if __name__ == "__main__":
main()
当我运行此文件时,它成功并且输出是“正在侦听端口5678”,而我的第二个文件是TwitterListener
import findspark
findspark.init()
import pyspark
import tweepy
from tweepy import OAuthHandler
from tweepy import Stream
from tweepy.streaming import StreamListener
import socket
import json
import time
consumer_key = '30f****'
consumer_secret = 'smu7B******
access_token = '153*******'
access_secret = 'QIizsB***'
auth = OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)
class TweetsListener(StreamListener):
def __init__(self, csocket):
self.client_socket = csocket
def on_data(self, data):
try:
msg = json.loads(data)
print(msg['text'].encode('utf-8'))
self.client_socket.send(msg['text'].encode('utf-8'))
return True
except BaseException as e:
print("Error on_data: %s" % str(e))
return True
def on_error(self, status):
print(status)
return True
def sendData(c_socket):
auth = OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)
twitter_stream = Stream(auth, TweetsListener(c_socket))
twitter_stream.filter(track=['fifa'])
if __name__ == "__main__":
s = socket.socket() # Create a socket object
host = "localhost" # Get local machine name
port = 5678 # Reserve a port for your service.
s.bind((host, port)) # Bind to the port
print("Listening on port: %s" % str(port))
s.listen(5) # Now wait for client connection.
c, addr = s.accept() # Establish connection with client.
print("Received request from: " + str(addr))
time.sleep(5)
sendData(c)
就像您看到Twitter侦听器文件监听端口localhost:5678一样。然后在文件TSWordCount中,我使用SparkContext(appname =“”),我认为我应该在twitter上放置我应用程序的名称,以便在这里放置Countwors124。然后,我通过ssc.socketTextStream(“ localhost”,5678)调用端口。但是我在运行TSWordCount时出现错误,出现错误说 无法一次运行多个SparkContext。现有的SparkContext(app = PySparkShell,master = local [*])创建者 我搜索错误,发现了一个解决方案,如使用sc.stop(),因此我将其放在ssc.awaitTermination()之后。但这没有用。我现在应该怎么办 ?
答案 0 :(得分:1)
我找到了答案。我将 20:21
1:20:02
12:20:02
40:21
替换为sc =SparkContext(appName="Countwords1234")
,一切正常。尽管我还是不明白,但最终结果很重要,哈哈