Question

我需要大量数据来分析weblogic应用程序异常错误。我的方法如下：

使用flume或其他流媒体工具将weblogic应用程序错误日志流式传输到hadoop。
将数据加载到Spark Dataframe中。
编写Spark SQL查询以分析错误数据

我们有一个数据库错误日志表。我将使用它作为另一个数据源来关联Web逻辑数据库异常。 weblogic错误数据是CSV格式，由两个管道符号分隔（＆＃34; ||＆＃34;）。但是，输入数据的问题是，最后一列数据分散到多行，如下所示。 Spark将下一行的最后一列的延续视为＆＃34; new line＆＃34;因此负载变得失败。感谢是否有人想过如何处理这个问题。

|| 20160704 || 01：58：32,294 || 396c0a8e2470e7a21467611910768 || com.seic.dataservices.impl.InstrumentSearchDoImpl || [ACTIVE] ExecuteThread：＆＃39; 9＆＃39; for queue：0）....表TABEL_NAME中不存在INSTRUMENT_ID（1004915）PRICE_DATE（01-JUL-16）。未找到价格数据.. ORA-06512：at＆＃34; Qxx_xxx.ERROR_PKG＆＃34;，第502行ORA-06512：at＆＃34; Qxx_xxx.IM_PRICING＆＃34;，第6221行ORA-06512：第1行）

-UK

更新：编辑输入数据集。

|| 20160704 || 00：32：48,544 || c0a07f3289f452801467606768492 || com.seic.dataservices.impl.GetInstDetailsForMaintImpl || [ACTIVE] ExecuteThread：＆＃39; 12＆＃39; for queue：＆＃39; weblogic.kernel.Default（self-tuning）＆＃39; || ERROR || ExceptionFactoryMsg：com.seic.dataservices.lib.DataServiceSqlException - 错误 - 处理此请求时遇到SQL异常。 - EX4 - - q02_Desktop_MS1＃20160704003248544＃4 - 附加信息:(错误代码：6550 - ） - 引起的（java.sql.SQLException：ORA-06550：第1行，第25列： PLS-00302：组件＆＃39; GET_ASSET_TEMPLATE＆＃39;必须申报 ORA-06550：第1行第7列： PL / SQL：语句被忽略） com.seic.dataservices.lib.DataServiceSqlException - 错误 - 处理此请求时遇到SQL异常。 - EX4 - - q02_Desktop_MS1＃20160704003248544＃4 - 附加信息:(错误代码：6550 - ） - 引起的（java.sql.SQLException：ORA-06550：第1行，第25列： PLS-00302：组件＆＃39; GET_ASSET_TEMPLATE＆＃39;必须申报 ORA-06550：第1行第7列： PL / SQL：语句被忽略） || 20160704 || 00：32：48,551 || c0a07f3289f452801467606768492 || com.seic.common.presentation.exception.SeiExceptionHandler || [ACTIVE] ExecuteThread：＆＃39; 12＆＃39; for queue：＆＃39; weblogic.kernel.Default（self-tuning）＆＃39; || ERROR || Non-SeiException javax.servlet.ServletException：DesktopAction基类中的错误（异常）在com.seic.common.presentation.action.DesktopAction.execute（DesktopAction.java:368）在org.apache.struts.chain.commands.servlet.ExecuteAction.execute（ExecuteAction.java:58）在org.apache.struts.chain.commands.AbstractExecuteAction.execute（AbstractExecuteAction.java:67）在org.apache.struts.chain.commands.ActionCommandBase.execute（ActionCommandBase.java:51）在org.apache.commons.chain.impl.ChainBase.execute（ChainBase.java:191）在org.apache.commons.chain.generic.LookupCommand.execute（LookupCommand.java:305）在org.apache.commons.chain.impl.ChainBase.execute（ChainBase.java:191）

Answer 1

好吧，我会告诉你你能做什么。

假设你总是有一对行甚至没问题。尝试这样做：

rdd = sc.textFile('file.log').zipWithIndex()
rddFirsts = rdd.filter(lambda x: not(x[1] % 2)).map(lambda x: (x[1], x[0]))
rddSeconds = rdd.filter(lambda x: x[1] % 2).map(lambda x: (x[1]-1, x[0]))
rdd = rddFirsts.join(rddSeconds)

使用此命令，它可以根据需要正常工作。我知道，这可能需要很长时间。但它运作正常。

我在Spark 1.5.2中测试过

EDITED

对于scala：

val rdd = sc.textFile("file.log").zipWithIndex()
val rddFirsts = rdd.filter(x => (x._2 % 2) == 0).map(x => (x._2, x._1))
val rddSeconds = rdd.filter(x => (x._2 % 2) != 0).map(x => (x._2-1, x._1))
val NewRdd = rddFirsts.join(rddSecons)

Spark数据帧 - weblogic应用程序错误日志分析

1 个答案: