我需要解析一个日志文件,并从该文件中选择某些列值,如下所示:
Created following board-groups: all, allp, mp, bp, tu, coremp, ommp, scb, sxb, scx, sccpmp, et, etmfg, etmfx, aal2ap, aal2ncc, aal2cpsrc, aal2rh, xp, rax, tx, ru, ru[0-6], gcpu.
>>> Type "bp" to view available board-groups and "bp <group>" to view group contents.
Collecting TN data...
.......................
Collecting RF data...
put /tempfiles/20190818-235430_204/lhCmd32220190818235439 /d/usr/lhCmd32220190818235439 ... OK
....
Getting MO data from node (84 MOs). Please wait...
0% ~50% ~100%
...........................................................................................................................................................................................................
Node: RBS6601L CXP102051/27_R34N49 18.Q1 (C18.Q1_LSV231_PA20)
=====================================================================================================================================
SMN ;APN ;BOARD ;SWALLOCATION ;S ;FAULT OPER MAINT STAT ;c/p ; d ;PRODUCTNUMBER ;REV ;SERIAL ;DATE ; TEMP; UPT ;MO
=====================================================================================================================================
0 ; 1 ;DUS3101 ;main ;1; OFF ON OFF OFF ;11% ;59% ;KDU137624/3 ;R3A/A ;CD39685797 ;20140303 ; 53C ;82.0 ;1,Slot=1
0 ; 2 ;DUS3102 ;DU_Extension ;1; OFF ON OFF OFF ; 5% ;35% ;KDU137624/31 ;R5D ;CD3W311461 ;20180323 ; 64C ;82.0 ;1,Slot=2
-------------------------------------------------------------------------------------------------------------------------------------
现在我需要选择与板,数据序列和临时列相对应的值 如何使用python和pyspark
实现此目的我将其转换为spark rdd,然后从中获取文件名以进行进一步处理
import pyspark
from pyspark import SparkConf
from pyspark.context import SparkContext
from pyspark.sql import SQLContext
from pyspark.sql.types import *
sc = SparkContext.getOrCreate(SparkConf().setMaster("local[*]"))
sqlContext = SQLContext(sc)
data=sc.textFile("filepath")
schema = StructType(list([StructField("nodeid", StringType(), True)]))
#print(node)
final= []
for line in node:
nodeid = line.strip().split("/")
#print(nodeid)
for text in nodeid:
subs = ".log"
if subs in text:
final.append(text[:6])
#print(final)
# R = Row('nodeId') #.sqlContext.createDataFrame(final,schema).dropDuplicates
#print(final)
final1=sc.parallelize(final)
df = sqlContext.createDataFrame((final1),StringType()).dropDuplicates()
+------+
| value|
+------+
|206111|
|300321|
|304081|
|304594|
|500151|
|601201|
|304071|
|300271|
|501644|
|502394|
|304341|
|300991|
|304481|
|302011|
|301061|
|201201|
|202141|
|300211|
|700291|
|301814|
+------+
这是我从我的代码中得到的。现在,对于每个文件,我都需要添加在解析到此数据帧后期望的列值。