窗口化和聚合pyspark DataFrame

时间:2017-08-22 05:43:56

标签: python apache-spark pyspark

我尝试处理来自套接字的传入事件,然后窗口化和聚合事件数据。我在窗户上遇到了麻烦。即使我为DataFrame指定了架构,它似乎也不会转换为列。

Usr

import sys from pyspark.sql.types import StructType, StringType, TimestampType, FloatType, IntegerType, StructField from pyspark.sql import SparkSession import pyspark.sql.functions as F if __name__ == "__main__": # our data currently looks like this (tab separated). # -SYMBOL DATE PRICE TICKVOL BID ASK # NQU7 2017-05-28T15:00:00 5800.50 12 5800.50 5800.50 # NQU7 2017-05-28T15:00:00 5800.50 1 5800.50 5800.50 # NQU7 2017-05-28T15:00:00 5800.50 5 5800.50 5800.50 # NQU7 2017-05-28T15:00:00 5800.50 1 5800.50 5800.50 if len(sys.argv) != 3: # print("Usage: network_wordcount.py <hostname> <port>", file=sys.stderr) exit(-1) spark = SparkSession \ .builder \ .appName("StructuredTickStream") \ .getOrCreate() sc = spark.sparkContext sc.setLogLevel('WARN') # Read all the csv files written atomically in a directory tickSchema = StructType([ StructField("symbol", StringType(), True), StructField("dt", TimestampType(), True), StructField("price", FloatType(), True), StructField("tickvol", IntegerType(), True), StructField("bid", FloatType(), True), StructField("ask", FloatType(), True) ]) events_df = spark \ .readStream \ .option("sep", "\t") \ .option("host", sys.argv[1]) \ .option("port", sys.argv[2]) \ .format("socket") \ .schema(tickSchema) \ .load() events_df.printSchema() print("columns = ", events_df.columns) ohlc_df = events_df \ .groupby(F.window("dt", "5 minutes", "1 minutes")) \ .agg( F.first('price').alias('open'), F.max('price').alias('high'), F.min('price').alias('low'), F.last('price').alias('close') ) \ .collect() query = ohlc_df \ .writeStream \ .outputMode("complete") \ .format("console") \ .start() query.awaitTermination() 的输出为print("columns = ", events_df.columns),并且该过程失败并显示以下跟踪:

['value']

知道我做错了吗?

1 个答案:

答案 0 :(得分:-1)

您的数据框只有一列value,您在此处尝试从此dt访问列events_df。这是问题的主要原因。

下面的陈述清楚地表明它有单列value

print("columns = ", events_df.columns)

你需要检查这个

events_df = spark \
    .readStream \
    .option("sep", "\t") \
    .option("host", sys.argv[1]) \
    .option("port", sys.argv[2]) \
    .format("socket") \
    .schema(tickSchema) \
    .load()

为什么只用一列创建df。