如何在不产生写入的情况下评估累加器?

时间:2020-11-03 21:55:26

标签: apache-spark pyspark counter

我想在编写数据框之前执行轻量级验证。在编写之前,我必须通过“ foo”序列化数据帧。我在“ foo”中增加一个累加器:

<svg width="400" height="120" viewBox="0 0 1000 300">
<defs>
<path id="MyPath1"
      d="M 100 200 
         C 200 100 300   0 400 100
         C 500 200 600 300 700 200
         C 800 100 900 100 900 100"/>

<path id="MyPath2" d="M300,300L700,50"/>
</defs>

<!-- red line under text. You can delete the following line  -->
<use href="#MyPath1" fill="none" stroke="red"/>

<text font-family="Verdana" font-size="42.5">
    <textPath href="#MyPath1">We go up, then we go down, then up again</textPath>
    <textPath href="#MyPath2" fill="red">And the second text</textPath>
</text>

问题在于acc = sc.accumulator(0) output = df.map(foo) if acc.value < THRESHOLD: raise ValueError(f"Failed validation: {acc.value} < {THRESHOLD}") output.write(path) ,因为显然直到acc.value == 0才对累加器进行求值,我想避免这种情况,因为数据验证失败。正确的设计模式是什么?

1 个答案:

答案 0 :(得分:1)

如果您的目标是在将数据发布到某个输出路径之前验证计数,只需将数据写入中间路径即可。然后评估累加器计数器,如果计数有效,则将中间路径重命名为实际输出目标。

acc = sc.accumulator(0)
output = df.map(foo)
output.write(tmp_path)
if acc.value < THRESHOLD:
 # fs.delete(tmp_path)
 raise ValueError(f"Failed validation: {acc.value} < {THRESHOLD}")
else fs.rename(tmp_path, path)