从日志文件中解析表格结构化数据

时间:2019-09-02 11:29:34

标签: python parsing pyspark

我需要解析一个日志文件,并从该文件中选择某些列值,如下所示:

Created following board-groups: all, allp, mp, bp, tu, coremp, ommp, scb, sxb, scx, sccpmp, et, etmfg, etmfx, aal2ap, aal2ncc, aal2cpsrc, aal2rh, xp, rax, tx, ru, ru[0-6], gcpu.
>>> Type "bp" to view available board-groups and "bp <group>" to view group contents.

Collecting TN data...
.......................
Collecting RF data...
put /tempfiles/20190818-235430_204/lhCmd32220190818235439 /d/usr/lhCmd32220190818235439 ... OK
....
Getting MO data from node (84 MOs). Please wait...
0%                                             ~50%                                           ~100%
...........................................................................................................................................................................................................

Node: RBS6601L                  CXP102051/27_R34N49 18.Q1 (C18.Q1_LSV231_PA20)
=====================================================================================================================================
SMN ;APN  ;BOARD    ;SWALLOCATION  ;S  ;FAULT OPER MAINT STAT   ;c/p ;  d  ;PRODUCTNUMBER  ;REV   ;SERIAL     ;DATE    ; TEMP;  UPT ;MO
=====================================================================================================================================
  0 ;  1  ;DUS3101  ;main          ;1;  OFF   ON   OFF   OFF    ;11% ;59%  ;KDU137624/3    ;R3A/A ;CD39685797 ;20140303 ; 53C ;82.0 ;1,Slot=1
  0 ;  2  ;DUS3102  ;DU_Extension  ;1;  OFF   ON   OFF   OFF    ; 5% ;35%  ;KDU137624/31   ;R5D   ;CD3W311461 ;20180323 ; 64C ;82.0 ;1,Slot=2
-------------------------------------------------------------------------------------------------------------------------------------

现在我需要选择与板,数据序列和临时列相对应的值 如何使用python和pyspark

实现此目的

我将其转换为spark rdd,然后从中获取文件名以进行进一步处理

import pyspark
from pyspark import SparkConf
from pyspark.context import SparkContext
from pyspark.sql import SQLContext
from pyspark.sql.types import *

sc = SparkContext.getOrCreate(SparkConf().setMaster("local[*]"))
sqlContext = SQLContext(sc)
data=sc.textFile("filepath")
schema = StructType(list([StructField("nodeid", StringType(), True)]))

#print(node)
final= []
for line in node:
    nodeid = line.strip().split("/")
    #print(nodeid)
    for text in nodeid:
        subs = ".log"
        if subs in text:
            final.append(text[:6])
            #print(final)
           # R = Row('nodeId') #.sqlContext.createDataFrame(final,schema).dropDuplicates
            #print(final)
final1=sc.parallelize(final)
df = sqlContext.createDataFrame((final1),StringType()).dropDuplicates()

+------+
| value|
+------+
|206111|
|300321|
|304081|
|304594|
|500151|
|601201|
|304071|
|300271|
|501644|
|502394|
|304341|
|300991|
|304481|
|302011|
|301061|
|201201|
|202141|
|300211|
|700291|
|301814|
+------+

这是我从我的代码中得到的。现在,对于每个文件,我都需要添加在解析到此数据帧后期望的列值。

0 个答案:

没有答案