Question

尝试运行一个小程序，从windows中的目录中读取流日志。日志来自logs文件夹中的logs_2016-04-05文件，下面给出了示例日志。我希望应用程序只在日志中打印结果，无论它是什么。它甚至在运行吗？

        SparkConf conf = new SparkConf().setAppName("Log Analyzer").setMaster("local[*]");
    JavaSparkContext sc = new JavaSparkContext(conf);

    JavaStreamingContext jssc = new JavaStreamingContext(sc,
            SLIDE_INTERVAL);


    JavaDStream<String> logData = jssc.textFileStream("C:/logs/");


         JavaDStream<ApacheAccessLog> accessLogsDStream = logData.flatMap(
            line -> {
                List<ApacheAccessLog> list = new ArrayList<>();
                try {
                    list.add(ApacheAccessLog.parseFromLogLine(line));
                    return list;
                } catch (RuntimeException e) {
                    return list;
                }
            }).cache();
    JavaDStream<ApacheAccessLog> windowDStream
            = accessLogsDStream.window(WINDOW_LENGTH, SLIDE_INTERVAL);
    windowDStream.foreachRDD(accessLogs -> {
        System.out.println(accessLogs.count());
        if (accessLogs.count() == 0) {
            System.out.println("No access logs in this time intervalsddssdsd");
            return null;
        }
    jssc.start();              
    jssc.awaitTermination();  



    Logs sample :

    64.242.88.10 - - [05/Apr/2016:10:27:07-0800] "GET /mailman/listinfo/hsdivision HTTP/1.1" 200 6291
    64.242.88.10 - - [05/Apr/2016:10:27:07-0800] "GET /mailman/listinfo/hsdivision HTTP/1.1" 200 6291
    64.242.88.10 - - [05/Apr/2016:10:27:07-0800] "GET /mailman/listinfo/hsdivision HTTP/1.1" 200 6291
    64.242.88.10 - - [05/Apr/2016:10:27:07-0800] "GET /mailman/listinfo/hsdivision HTTP/1.1" 200 6291

结果：

  16/04/05 12:42:25 INFO scheduler.JobScheduler: Started JobScheduler
  16/04/05 12:42:30 INFO dstream.FlatMappedDStream: Slicing from 1459840330000 ms to 1459840350000 ms (aligned to 1459840330000 ms and 1459840350000 ms)
  16/04/05 12:42:30 INFO dstream.FlatMappedDStream: Time 1459840340000 ms is invalid as zeroTime is 1459840340000 ms and slideDuration is 10000 ms and difference is 0 ms
  16/04/05 12:42:31 INFO dstream.FileInputDStream: Finding new files took 549 ms
  16/04/05 12:42:31 INFO dstream.FileInputDStream: New files at time 1459840350000 ms:

  16/04/05 12:42:31 INFO dstream.FlatMappedDStream: Persisting RDD 2 for time 1459840350000 ms to StorageLevel(false, true, false, false, 1) at time 1459840350000 ms
  16/04/05 12:42:31 INFO scheduler.JobScheduler: Added jobs for time 1459840350000 ms
  16/04/05 12:42:31 INFO scheduler.JobScheduler: Starting job streaming job 1459840350000 ms.0 from job set of time 1459840350000 ms
  16/04/05 12:42:31 INFO spark.SparkContext: Starting job: count at Streaming2.java:76
  16/04/05 12:42:31 INFO spark.SparkContext: Job finished: count at Streaming2.java:76, took 0.121043739 s
  0
  16/04/05 12:42:31 INFO spark.SparkContext: Starting job: count at Streaming2.java:77
  16/04/05 12:42:31 INFO spark.SparkContext: Job finished: count at Streaming2.java:77, took 2.408E-5 s
  No access logs in this time intervalsddssdsd
  16/04/05 12:42:31 INFO scheduler.JobScheduler: Finished job streaming job 1459840350000 ms.0 from job set of time 1459840350000 ms
  16/04/05 12:42:31 INFO scheduler.JobScheduler: Total delay: 1.973 s for time 1459840350000 ms (execution: 0.555 s)
  16/04/05 12:42:32 INFO dstream.FileInputDStream: Cleared 0 old files that were older than 1459840310000 ms: 
  16/04/05 12:42:40 INFO dstream.FlatMappedDStream: Slicing from 1459840340000 ms to 1459840360000 ms (aligned to 1459840340000 ms and 1459840360000 ms)
  16/04/05 12:42:40 INFO dstream.FlatMappedDStream: Time 1459840340000 ms is invalid as zeroTime is 1459840340000 ms and slideDuration is 10000 ms and difference is 0 ms
  16/04/05 12:42:40 INFO dstream.FileInputDStream: Finding new files took 2 ms
  16/04/05 12:42:40 INFO dstream.FileInputDStream: New files at time 1459840360000 ms:

  16/04/05 12:42:40 INFO dstream.FlatMappedDStream: Persisting RDD 6 for time 1459840360000 ms to StorageLevel(false, true, false, false, 1) at time 1459840360000 ms
  16/04/05 12:42:40 INFO scheduler.JobScheduler: Added jobs for time 1459840360000 ms
  16/04/05 12:42:40 INFO scheduler.JobScheduler: Starting job streaming job 1459840360000 ms.0 from job set of time 1459840360000 ms
  16/04/05 12:42:40 INFO spark.SparkContext: Starting job: count at Streaming2.java:76
  16/04/05 12:42:40 INFO spark.SparkContext: Job finished: count at Streaming2.java:76, took 3.158E-5 s
  0
  16/04/05 12:42:40 INFO spark.SparkContext: Starting job: count at Streaming2.java:77
  16/04/05 12:42:40 INFO spark.SparkContext: Job finished: count at Streaming2.java:77, took 4.5003E-5 s
  No access logs in this time intervalsddssdsd
  16/04/05 12:42:40 INFO scheduler.JobScheduler: Finished job streaming job 1459840360000 ms.0 from job set of time 1459840360000 ms
  16/04/05 12:42:40 INFO scheduler.JobScheduler: Total delay: 0.034 s for time 1459840360000 ms (execution: 0.013 s)
  16/04/05 12:42:40 INFO rdd.UnionRDD: Removing RDD 3 from persistence list
  16/04/05 12:42:40 INFO storage.BlockManager: Removing RDD 3
  16/04/05 12:42:40 INFO dstream.FileInputDStream: Cleared 0 old files that were older than 1459840320000 ms: 
  16/04/05 12:42:50 INFO dstream.FlatMappedDStream: Slicing from 1459840350000 ms to 1459840370000 ms (aligned to 1459840350000 ms and 1459840370000 ms)
  16/04/05 12:42:50 INFO dstream.FileInputDStream: Finding new files took 0 ms
  16/04/05 12:42:50 INFO dstream.FileInputDStream: New files at time 1459840370000 ms:

  16/04/05 12:42:50 INFO dstream.FlatMappedDStream: Persisting RDD 10 for time 1459840370000 ms to StorageLevel(false, true, false, false, 1) at time 1459840370000 ms
  16/04/05 12:42:50 INFO scheduler.JobScheduler: Added jobs for time 1459840370000 ms

Answer 1

您只需要一些输出操作，即“触发所有DStream转换的实际执行（类似于RDD的操作）”，http://spark.apache.org/docs/latest/streaming-programming-guide.html#output-operations-on-dstreams。
例如，windowDStream.count().print()或windowDStream.print()

试图从日志文件

1 个答案: