Question

从documentation读取数据流到Apache Spark的标准方法是：

events = spark.readStream \
  .format("json") \           # or parquet, kafka, orc...
  .option() \                 # format specific options
  .schema(my_schema) \        # required
  .load("path/to/data")

但我需要清理一些数据，在我应用架构之前重新安排一些字段，我希望会有一个

events = spark.readStream \
  .format("json") \           # or parquet, kafka, orc...
  .option() \                 # format specific options
  .schema(my_schema) \        # required
  **.map(custom_function)**   # apply a custom function to the json object
  .load("path/to/data")

使用结构化流媒体在Apache Spark中有一种有效的方法吗？

Answer 1

tl; dr 简短回答，在加载数据集之前无法执行此操作。

我想到的唯一方法是将数据集作为一组字符串加载，并使用一系列withColumn或select转换进行清理，实际上是.map(custom_function) from collections import Counter #Your starting input - a phrase with an ID #I added some extra words to show count dict1 = {'023345': 'I love Python love Python Python'} #Split the dict vlue into a list for counting dict1['023345'] = dict1['023345'].split() #Use counter to count countlist = Counter(dict1['023345']) #count list is now "Counter({'I': 1, 'Python': 1, 'love': 1})" #If you want to output it like you requested, interate over the dict for key, value in dict1.iteritems(): id1 = key for key, value in countlist.iteritems(): print id1, key, value

Answer 2

同意Jacek的回答。更具体地说，您有两种选择：

应用＆＃34;超级架构＆＃34;输入数据然后操作到您想要的架构。当（a）所有数据都是有效的JSON和（b）＆＃34;超级模式＆＃34;时，这是最好的方法。有些稳定，例如，动态字段名称不存在。
以文本形式阅读，使用json4s（或您选择的其他库）进行解析，根据需要进行操作。如果（a）任何输入行可能不是有效的JSON或（b）没有稳定的＆＃34;超级架构＆＃34;这是最好的方法。

如何在加载整个数据集之前将自定义数据格式/映射应用于每个事件？

2 个答案: