使用Java的Apache Spark Streaming自定义接收器(文本文件)

时间:2018-01-19 06:53:47

标签: java apache-spark spark-streaming

我是Apache Spark的新手。

我需要从本地/已安装目录中读取日志文件。 一些外部源将文件写入本地/挂载目录。 例如。外部源写入日志到0_combined_file.txt文件,一旦文件写入完成,外部源创建前缀为 0 _ 的新文件,如combined_file.txt。然后我需要阅读@Override public void onStart() { Runnable th = () -> { while (true) { try { Thread.sleep(1000l); File dir = new File("/home/PK01/Desktop/arcflash/"); File[] completedFiles = dir.listFiles((dirName, fileName) -> { return fileName.toLowerCase().startsWith("0_"); }); //metaDataFile --> 0_test.txt //completedFiles --> test.txt for (File metaDataFile : completedFiles) { String compFileName = metaDataFile.getName(); compFileName = compFileName.substring(2, compFileName.length()); File dataFile = new File("/home/PK01/Desktop/arcflash/" + compFileName); if (dataFile.exists()) { byte[] data = new byte[(int) dataFile.length()]; fis.read(data); fis.close(); store(new String(data)); dataFile.delete(); metaDataFile.delete(); } } } catch (Exception e) { e.printStackTrace(); } } }; new Thread(th); } 日志文件并进行处理。 所以我正在尝试编写自定义接收器来检查写入本地/已挂载目录的日志文件是否已完成,然后读取已完成的文件。

这是我的代码

JavaReceiverInputDStream<String> data = jssc.receiverStream(receiver);
data.foreachRDD(fileStreamRdd -> {
                        processOnSingleFile(fileStreamRdd.flatMap(streamBatchData -> {
                        return Arrays.asList(streamBatchData.split("\\n")).iterator();
                    }));
});

我正在尝试处理如下数据。

18/01/19 12:08:39 WARN RandomBlockReplicationPolicy: Expecting 1 replicas with only 0 peer/s.
18/01/19 12:08:39 WARN BlockManager: Block input-0-1516343919400 replicated to only 0 peer(s) instead of 1 peers
18/01/19 12:08:40 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
java.lang.OutOfMemoryError: Java heap space
    at com.esotericsoftware.kryo.io.Output.<init>(Output.java:60)
    at org.apache.spark.serializer.KryoSerializer.newKryoOutput(KryoSerializer.scala:91)
    at org.apache.spark.serializer.KryoSerializerInstance.output$lzycompute(KryoSerializer.scala:308)
    at org.apache.spark.serializer.KryoSerializerInstance.output(KryoSerializer.scala:308)
    at org.apache.spark.serializer.KryoSerializerInstance.serialize(KryoSerializer.scala:312)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:383)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
18/01/19 12:08:40 ERROR SparkUncaughtExceptionHandler: Uncaught exception in thread Thread[Executor task launch worker for task 1,5,main]
java.lang.OutOfMemoryError: Java heap space
    at com.esotericsoftware.kryo.io.Output.<init>(Output.java:60)
    at org.apache.spark.serializer.KryoSerializer.newKryoOutput(KryoSerializer.scala:91)
    at org.apache.spark.serializer.KryoSerializerInstance.output$lzycompute(KryoSerializer.scala:308)
    at org.apache.spark.serializer.KryoSerializerInstance.output(KryoSerializer.scala:308)
    at org.apache.spark.serializer.KryoSerializerInstance.serialize(KryoSerializer.scala:312)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:383)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
18/01/19 12:08:40 WARN TaskSetManager: Lost task 0.0 in stage 1.0 (TID 1, localhost, executor driver): java.lang.OutOfMemoryError: Java heap space
    at com.esotericsoftware.kryo.io.Output.<init>(Output.java:60)
    at org.apache.spark.serializer.KryoSerializer.newKryoOutput(KryoSerializer.scala:91)
    at org.apache.spark.serializer.KryoSerializerInstance.output$lzycompute(KryoSerializer.scala:308)
    at org.apache.spark.serializer.KryoSerializerInstance.output(KryoSerializer.scala:308)
    at org.apache.spark.serializer.KryoSerializerInstance.serialize(KryoSerializer.scala:312)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:383)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)

但要低于异常

this is my main file which name is newone.py

from flask import Flask, render_template
from test import test
from unittest import TestCase
import unittest
from unittest import TestCase
import HTMLTestRunner
from flask import request

app=Flask(__name__)


@app.route('/submit')
def test2():
            return test.testlogin()

@app.route('/view')
def test3():
    return render_template("no.html")


@app.route('/')
def test1():
            return render_template('profile.html')



if __name__ == "__main__":
    app.run(threaded=False)
    HTMLTestRunner.main()



 and my another file is test.py in this  main function are calling by main file

import unittest
from random import randint
from flask import Flask, render_template
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time
from unittest import TestCase, main as unittest_main
import MySQLdb
import HTMLTestRunner



class test(unittest.TestCase):

    @classmethod
    def testlogin(self):

        # if request.method:'POST'()
        browser = webdriver.Chrome()
        db = MySQLdb.connect("127.0.0.1", "root", "", "first")
        cursor = db.cursor()
        cursor.execute("select * from login WHERE id = 1")
        results = cursor.fetchall()
        error="in"
        # total_records = len(results)
        # random_id = randint(1,total_records)
        # print(random_id)
        #
        # cursor.execute("SELECT * FROM login WHERE id = %s" % random_id)
        # results = cursor.fetchall()

        if (results):

            for row in results:
                data = row
                print("data[0] is : ", data[0])
                print("data[1] is : ", data[1])
                print("data[2] is : ", data[2])
                # browser = self.browser
                browser.get("http://itechnotion.in/redonno/php/giver/giver/login")
                user = browser.find_element_by_css_selector('#txt_church_email')
                user.send_keys(data[1])
                password = browser.find_element_by_css_selector('#txt_church_password')
                password.send_keys(data[2])
                login = browser.find_element_by_css_selector('button')
                login.click()
                browser.close()
                return "ff"
        else:
            assert "No results found." not in browser.page_source


if __name__ == "main":
    HTMLTestRunner.main

任何人都可以帮我解决这个错误。

任何帮助将不胜感激

1 个答案:

答案 0 :(得分:0)

18/01/19 12:08:40错误SparkUncaughtExceptionHandler:线程中的未捕获异常线程[任务1,5的执行任务启动工作者,主要] java.lang.OutOfMemoryError:Java堆空间

上面显示您遇到内存不足错误。在提交spark作业时明确增加内存