如何根据行数拆分s3对象

时间:2017-06-14 06:19:51

标签: java amazon-s3

我在亚马逊S3中有一个文件,它有大约200万条记录。现在我想使用线程处理这些记录,以便快速完成处理。我知道这可以使用spark或mapreduce来完成。但我不能使用火花或MR,因为它是一个约束。

目前我做了以下

for (S3ObjectSummary objectSummary : objectListing.getObjectSummaries()) {
    S3Object s3object = s3Client
                        .getObject(new GetObjectRequest(s3Conn.getBucket(), objectSummary.getKey()));
    BufferedReader reader = new BufferedReader(new InputStreamReader(s3object.getObjectContent()));

    List<Events> ingEvents = new LinkedList<>();
    while ((fileLine = reader.readLine()) != null) {

              //Processing the line

                }
}

任何有关如何在java中执行此操作的建议都将会有很大帮助。提前致谢。干杯!

3 个答案:

答案 0 :(得分:0)

我会使用&#34; split&#34; Linux中的命令

例如,将一个大文件拆分成每个包含10000行的较小文件:

<system.web>
    <httpRuntime executionTimeout="5" targetFramework="4.5"/>
    <compilation targetFramework="4.5" />
</system.web>

然后,Java程序可以处理每个单独的文件

答案 1 :(得分:0)

您可以使用java.util.Scanner逐行或正则表达式来读取文件。简短演示,演示如何操作:

String xmlFile = null;
        Scanner sc = new Scanner(new File(xmlFile));

        String nextLine;
        while ((nextLine = sc.nextLine()) != null) {

        }

首先创建Scanner对象,将其作为参数File xmlFile。接下来,您逐行读取文件并在while循环中处理行。读取所有行时sc.nextLine()返回null。

答案 2 :(得分:0)

多线程处理文件的简单方法是使用Java 8 lambdas,例如:

public class ThreadTest {
    static final int THREAD_POOL_SIZE = 3;

    static final String []myData = {
            "Line 1","Line 2","Line 3","Line 4","Line 5","Line 6","Line 7","Line 8","Line 9","Line 10","Line 11","Line 12"
    };
    static final List<String> myList = Arrays.asList(myData);

    public static void main(String[] args) {
        ExecutorService service = Executors.newFixedThreadPool(THREAD_POOL_SIZE);
        myList.stream().parallel().forEach(item->{
            System.out.println("Processing " + item + " in thread " + Thread.currentThread().getName());
        });
    }
}

如果你运行它,你会看到同时跨多个线程处理的行:

Processing Line 8 in thread main
Processing Line 4 in thread ForkJoinPool.commonPool-worker-1
Processing Line 9 in thread main
Processing Line 11 in thread ForkJoinPool.commonPool-worker-2
Processing Line 2 in thread ForkJoinPool.commonPool-worker-3
Processing Line 12 in thread ForkJoinPool.commonPool-worker-2
Processing Line 7 in thread main
Processing Line 6 in thread ForkJoinPool.commonPool-worker-1
Processing Line 1 in thread main
Processing Line 10 in thread ForkJoinPool.commonPool-worker-2
Processing Line 3 in thread ForkJoinPool.commonPool-worker-3
Processing Line 5 in thread ForkJoinPool.commonPool-worker-1