如何在Spark中提取上一行和下一行句子?

时间:2018-10-18 04:54:04

标签: scala apache-spark apache-spark-sql bigdata spark-streaming

我正在使用Apache Spark分析日志文件以进行客户影响分析。我有一个日志文件,其中一行包含时间戳,另一行包含客户的详细信息,另一行包含由错误引起的错误,我希望一个文件中的输出将所有提取的记录合并到一行。这是我的以下日志文​​件:

$ ./bin/getlinestrtok
myShell >> my dog has fleas
token: my
token: dog
token: has
token: fleas
myShell >> my cat has none
token: my
token: cat
token: has
token: none
myShell >> happy cat
token: happy
token: cat
myShell >>

1 个答案:

答案 0 :(得分:0)

您可以通过几种方式使用DataFrame API来执行此操作。这是一个

import org.apache.spark.sql.functions._

val rd = sc.textFile("/FileStore/tables/log.txt").zipWithIndex.map{case (r, i) => Row(r, i)}
val schema = StructType(StructField("logs", StringType, false) :: StructField("id", LongType, false) :: Nil)
val df = spark.sqlContext.createDataFrame(rd, schema)
df.show

+--------------------+---+
|                logs| id|
+--------------------+---+
|2018-10-15 05:24:...|  0|
|                    |  1|
|com.xyz.abc.pqr.e...|  2|
|    at com.xyz.ab...|  3|
|    at java.util....|  4|
|    rContainer.do...|  5|
|    at org.spring...|  6|
|    at org.spring...|  7|
|    at java.lang....|  8|
|Caused by: java.l...|  9|
|    at com.xyz.ab...| 10|
|    at com.xyz.ab...| 11|
|    at com.xyz.ab...| 12|
|    at java.util....| 13|
|                    | 14|
|2018-10-15 05:24:...| 15|
|                    | 16|
|com.xyz.abc.pqr.P...| 17|

val df1 = df.filter($"logs".contains("c.l.p.a.c.event.listener.MQListener")).withColumn("logs",regexp_replace($"logs","ERROR.*","")).sort("id")
df1.show

+--------------------+---+
|                logs| id|
+--------------------+---+
|2018-10-15 05:24:...|  0|
|2018-10-15 05:24:...| 15|
+--------------------+---+

val df2 = df.filter($"logs".contains("PrescriptionNotValidException:")).withColumn("logs",regexp_replace($"logs","(.*?)mandatory fields.","")).sort("id")
df2.show

+--------------------+---+
|                logs| id|
+--------------------+---+
| StoreId: 123, Co...|  2|
| StoreId: 234, Co...| 17|
+--------------------+---+

val df3 = df.filter($"logs".contains("Caused by: java.lang.")).sort("id")
df3.show
df1.select("logs").collect.toSeq.zip(df2.select("logs").collect.toSeq).zip(df3.select("logs").collect.toSeq)

+--------------------+---+
|                logs| id|
+--------------------+---+
|Caused by: java.l...|  9|
|Caused by: java.l...| 28|
+--------------------+---+

df3: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [logs: string, id: bigint]
res71: Seq[((org.apache.spark.sql.Row, org.apache.spark.sql.Row), org.apache.spark.sql.Row)] = ArrayBuffer((([2018-10-15 05:24:00.102 ],[ StoreId: 123, Co Patient Id: 123456789, Rx Number: 12345678]),[Caused by: java.lang.IllegalArgumentException: Invalid Dispense Object because compound: null and pack: null were missing.]), (([2018-10-15 05:24:25.136 ],[ StoreId: 234, Co Patient Id: 999999, Rx Number: 45555]),[Caused by: java.lang.NullPointerException: null]))