我希望在将csv文件放入目录时进行流式传输。在RDD中获取数据后,我想从两个csv中创建两个表(EVENTS和ACCESSPANE)并使用密钥加入它们。我是Spark Streaming的新手。我的代码如下。
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
from pyspark.sql import SQLContext
from pyspark.sql.types import *
from pyspark.sql import Row
if __name__ == "__main__":
sc = SparkContext(appName="PythonStrWordCount")
sqlContext = SQLContext(sc)
df = sqlContext.read.load('F:\\Documents\\Project\\POC\\ACCESSPANE.csv',format='com.databricks.spark.csv',header='true',inferSchema='true')
df.registerTempTable("ACCESSPANE")
ssc = StreamingContext(sc,20)
lines=ssc.textFileStream("C:\\SparkScala\\spark-1.6.2-bin-hadoop2.6\\examples\\src\\main\\python\\streaming\\Stream_Data\\streamdbdata")
datalines=lines.filter(lambda x:"MACHINE" not in x)
parts=datalines.map(lambda l: l.split(","))
db1=parts.map(lambda p:Row(devid=(p[1]),machine=(p[8]),timeutc=p[29]))
#df1 = (db1.withColumn('timeutc', db1.timeutc.cast('timestamp')))
df1=sqlContext.createDataFrame(db1)
df1.registerTempTable("EVENTS")
sqlContext.sql("select b.panelid from EVENTS a,ACCESSPANE b where
a.MACHINE=b.panelid").show()
ssc.start()
ssc.awaitTermination()
运行后,代码给出了以下错误:
错误:TypeError:' TransformedDStream'对象不可迭代
请帮我看看每3分钟后显示加入数据的操作。
16/09/04 20:12:48 INFO BlockManagerInfo: Removed broadcast_2_piece0 on localhost:61230 in memory (size: 19.3 KB, free: 511.1 MB)
16/09/04 20:12:48 INFO BlockManagerInfo: Removed broadcast_1_piece0 on localhost:61230 in memory (size: 1877.0 B, free: 511.1 MB)
16/09/04 20:12:48 INFO ContextCleaner: Cleaned accumulator 2
16/09/04 20:12:48 INFO BlockManagerInfo: Removed broadcast_0_piece0 on localhost:61230 in memory (size: 19.3 KB, free: 511.1 MB)
16/09/04 20:12:48 INFO FileInputDStream: Duration for remembering RDDs set to 60000 ms for org.apache.spark.streaming.dstream.FileInputDStream@486a2d4c
Traceback (most recent call last):
File "F:/Documents/Internship/Project_Lenel/dbstream.py", line 19, in <module>
df1=sqlContext.createDataFrame(db1)
File "C:\SparkScala\spark-1.6.2-bin-hadoop2.6\python\lib\pyspark.zip\pyspark\sql\context.py", line 425, in createDataFrame
File "C:\SparkScala\spark-1.6.2-bin-hadoop2.6\python\lib\pyspark.zip\pyspark\sql\context.py", line 338, in _createFromLocal
TypeError: 'TransformedDStream' object is not iterable
16/09/04 20:12:48 INFO SparkContext: Invoking stop() from shutdown hook
16/09/04 20:12:48 INFO SparkUI: Stopped Spark web UI at http://10.0.0.178:4040
16/09/04 20:12:48 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
16/09/04 20:12:48 INFO MemoryStore: MemoryStore cleared
16/09/04 20:12:48 INFO BlockManager: BlockManager stopped
16/09/04 20:12:48 INFO BlockManagerMaster: BlockManagerMaster stopped
16/09/04 20:12:48 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
16/09/04 20:12:48 INFO SparkContext: Successfully stopped SparkContext
16/09/04 20:12:48 INFO RemoteActorRefProvider$RemotingTerminator: Shutting down remote daemon.
16/09/04 20:12:48 INFO RemoteActorRefProvider$RemotingTerminator: Remote daemon shut down; proceeding with flushing remote transports.
16/09/04 20:12:48 INFO ShutdownHookManager: Shutdown hook called
16/09/04 20:12:48 INFO ShutdownHookManager: Deleting directory C:\Users\tailo\AppData\Local\Temp\spark-cea2c89b-46ac-4063-850a-aede53836d0c\pyspark-aad59f30-5769-4b98-8aa4-7e556f3adc56
16/09/04 20:12:48 INFO ShutdownHookManager: Deleting directory C:\Users\tailo\AppData\Local\Temp\spark-cea2c89b-46ac-4063-850a-aede53836d0c\httpd-bc2d2da1-9a27-4462-8c21-84a0e887b3c1
16/09/04 20:12:48 INFO ShutdownHookManager: Deleting directory C:\Users\tailo\AppData\Local\Temp\spark-cea2c89b-46ac-4063-850a-aede53836d0c