Question

我有一个日志文件，需要在每一行中进行检查。每当“错误”词出现在任何一行中时，我都需要在该行之后加上下两行。我必须在pyspark中做到这一点。

例如：输入日志文件：

第1行

第2行

行...错误... 3

第4行

第5行

第6行

输出将是：

第4行

第5行

我使用日志文件并使用map（）遍历每一行创建了rdd，但是我没有确切的主意。

谢谢。

Answer 1

类似的东西

Category.twig
{% block content %}
{% for post in posts %}
<li>{{ post.content }}</li>
{% endfor %}

{% endblock %}

Answer 2

这是使用开窗功能的方法：

from pyspark.sql import functions as F
from pyspark.sql.window import Window

# set up DF
df = sc.parallelize([["line1"], ["line2"], ["line3..ERROR"], ["line4"], ["line5"]]).toDF(['col'])

# create an indicator that created a boundary between consecutive errors
win1 = Window.orderBy('col')
df = df.withColumn('hit_error', F.expr("case when col like '%ERROR%' then 1 else 0 end"))
df = df.withColumn('cum_error', F.sum('hit_error').over(win1))

# now count the lines between each error occurrence
win2 = Window.partitionBy('cum_error').orderBy('col')
df = df.withColumn('rownum', F.row_number().over(win2))

# the lines we want are rows 2,3
df.filter("cum_error>0 and rownum in (2,3)").select("col").show(10)```

pyspark中的任何一行出现错误字时，如何获取文件中的下一行？

2 个答案: