Spark Streaming无法启动

时间:2019-04-24 21:21:20

标签: apache-spark pyspark spark-streaming

我在jupyter笔记本电脑上,想模拟一台服务器,以便在另一个笔记本电脑上运行的Spark Streaming应用程序上发送虚拟数据。

所以,我的服务器代码是:

# # -1) imports
import socket
import random
import time

# # 0) configuration
port = 12030
ip   = socket.gethostname()

# # 1) création d'une socket
serversocket = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
serversocket.bind((ip, port))
serversocket.listen(1)
serversocket.setblocking(False) 

# # 2) attendre que spark se connecte
(clientsocket, address) = serversocket.accept()
print("Connection de %s :\n %s"%(address, clientsocket))

# # 3) envoie de données    
nb_client = 1000
nb_achat  = 5
clients   = ["client_%s"%x for x in range(nb_client)]
achats    = [(random.choice(clients), random.randint(0, 100)) for x in range(nb_achat)]

tps_attente = 1
nb_achat    = 50
for i in range(30):
    print(i)
    time.sleep(tps_attente)
    achats    = [(random.choice(clients), random.randint(0, 100)) for x in range(nb_achat)]
    for client, valeur in achats:
        to_send = '%s,%s\n'%(client, valeur)
        clientsocket.send(to_send.encode())

我的SparkStreaming笔记本是:

import socket
from pyspark.sql import SparkSession
from pyspark.streaming import StreamingContext

listen_to_ip   = socket.gethostname()
listen_to_port = 12030

spark       = SparkSession.builder.getOrCreate()
sc          = spark.sparkContext
nb_secondes = 4
ssc         = StreamingContext(sc, nb_secondes)
dstream     = ssc.socketTextStream(listen_to_ip, listen_to_port)

ssc.checkpoint("./checkpoint/")

def update_achats(nouvelles_valeurs, valeur_actuelle ):
    if valeur_actuelle is None:
        valeur_actuelle = 0
    return sum(nouvelles_valeurs, valeur_actuelle)


data            = dstream.map(lambda x: x.split(","))
clients_facture = data.map(lambda x: (x[0], float(x[1])*float(x[2])))
update_client   = clients_facture.updateStateByKey(update_achats)
update_client.pprint()

ssc.start()

所以我首先启动服务器,然后启动sparkStreaming。

在服务器上,我首先看到:

Connection de ('172.17.0.2', 40258) :
 <socket.socket fd=44, family=AddressFamily.AF_INET, type=SocketKind.SOCK_STREAM, proto=0, laddr=('172.17.0.2', 12030), raddr=('172.17.0.2', 40258)>

然后在服务器上,我看到循环的进程,表明已发送数据:

0
1
2
3
4
5
6
7
8
9
10
...

在Spark Streaming笔记本上,什么都没有出现:-(

我记得一年前我已经遇到过这个问题,它一定是配置问题。有什么线索吗?

0 个答案:

没有答案