我想尝试在python中加载csv数据,并通过SPark Streaming流式传输每一行spark。
我是网络新手。我不完全是因为我应该创建一个服务器python脚本,一旦它建立连接(与火花流),它将开始发送每一行。在Spark Streaming Documentation中,他们执行nc -l 9999,如果我正确的话,它是一个监听端口9999的netcat服务器。所以我尝试创建类似于解析csv并在端口60000上发送的python脚本
import socket # Import socket module
import csv
port = 60000 # Reserve a port for your service.
s = socket.socket() # Create a socket object
host = socket.gethostname() # Get local machine name
s.bind((host, port)) # Bind to the port
s.listen(5) # Now wait for client connection.
print('Server listening....')
while True:
conn, addr = s.accept() # Establish connection with client.
print('Got connection from', addr)
csvfile = open('Titantic.csv', 'rb')
reader = csv.reader(csvfile, delimiter = ',')
for row in reader:
line = ','.join(row)
conn.send(line)
print(line)
csvfile.close()
print('Done sending')
conn.send('Thank you for connecting')
conn.close()
SPark Streaming Script -
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
ssc = StreamingContext(sc, 1)
# Create a DStream that will connect to hostname:port, like localhost:9999
lines_RDD = ssc.socketTextStream("localhost", 60000)
# Split each line into words
data_RDD = lines_RDD.flatMap(lambda line: line.split(","))
data_RDD.pprint()
ssc.start() # Start the computation
ssc.awaitTermination() # Wait for the computation to terminate
当运行spark脚本(这是在Jupyter笔记本btw)我收到此错误 - IllegalArgumentException:'要求失败:未注册任何输出操作,因此无需执行任何操作'
我'不要以为我正在做我的套接字脚本但我不知道该做什么我基本上试图复制nc -lk 9999所做的事情所以我可以通过端口发送文本数据然后火花流正在收听它并接收数据和处理它。
非常感谢任何帮助
答案 0 :(得分:2)
我试图做类似的事情,但我想每隔10秒流一行。我用这个脚本解决了:
import socket
from time import sleep
host = 'localhost'
port = 12345
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.bind((host, port))
s.listen(1)
while True:
print('\nListening for a client at',host , port)
conn, addr = s.accept()
print('\nConnected by', addr)
try:
print('\nReading file...\n')
with open('iris_test.csv') as f:
for line in f:
out = line.encode('utf-8')
print('Sending line',line)
conn.send(out)
sleep(10)
print('End Of Stream.')
except socket.error:
print ('Error Occured.\n\nClient disconnected.\n')
conn.close()
希望这有帮助。