在Spark上使用Scala在Dataframe中拆分字符串

时间:2019-03-04 05:08:38

标签: scala apache-spark apache-spark-sql

我有一个包含100多个列的日志文件。其中我只需要两列“ _raw”和“ _time”,所以我创建了将日志文件加载为“ csv” DF的方式。

步骤1:

scala> val log = spark.read.format("csv").option("inferSchema", "true").option("header", "true").load("soa_prod_diag_10_jan.csv")
log: org.apache.spark.sql.DataFrame = [ARRAffinity: string, CoordinatorNonSecureURL: string ... 126 more fields]

步骤2: 我将DF注册为临时表 log.createOrReplaceTempView("logs"

步骤3:我提取了两个必填列“ _raw”和“ _time”

scala> val sqlDF = spark.sql("select _raw, _time from logs")
sqlDF: org.apache.spark.sql.DataFrame = [_raw: string, _time: string]

scala> sqlDF.show(1, false)
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----+
|_raw                                                                                                                                                                                                                                                                                                                                                                                                |_time|
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----+
|[2019-01-10T23:59:59.998-06:00] [xx_yyy_zz_sss_ra10] [ERROR] [OSB-473003] [oracle.osb.statistics.statistics] [tid: [ACTIVE].ExecuteThread: '28' for queue: 'weblogic.kernel.Default (self-tuning)'] [userId: <anonymous>] [ecid: 92b39a8b-8234-4d19-9ac7-4908dc79c5ed-0000bd0b,0] [partition-name: DOMAIN] [tenant-name: GLOBAL] Aggregation Server Not Available. Failed to get remote aggregator[[|null |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----+
only showing top 1 row

我的要求:

我需要在'_raw'列中拆分字符串以产生 [2019-01-10T23:59.59-06:00] [xx_yyy_zz_sss_ra10] [错误] [OSB-473003] [oracle.osb.statistics.statistics] [ecid:92b39a8b-8234-4d19-9ac7-4908dc79c5ed-0000bd0b]列名分别为a,b,c,d,e,f

还将“ _raw”和“ _time”中的所有空值都删除

您的答案将不胜感激:)

1 个答案:

答案 0 :(得分:2)

您可以拆分函数,并按空格拆分_raw。这将返回一个数组,然后您可以从该数组中提取值。您还可以使用regexp_extract函数从日志消息中提取值。这两种方式如下所示。希望对您有所帮助。

 public static void main(String[] args) {
    try {
        String line;
        String cmd="ps -e"+" | "+"grep"+" MYPROCESS.sh";
        System.out.println(cmd);

        Process p = Runtime.getRuntime().exec(cmd);
        BufferedReader input =            new BufferedReader(new InputStreamReader(p.getInputStream()));
        if(input == null) 
        {  
            System.out.println("BufferReader is null Object");
        }
        while ((line = input.readLine()) != null) {
                System.out.println(line);    // **getting the line as null** 
        }
        input.close();
    }
    catch (Exception err)
    {
        err.printStackTrace();
    }