使用hashmap或hashset比较大的csv文件

时间:2018-01-03 11:03:15

标签: java csv hashmap compare hashset

我试图比较两个巨大的CSV文件。第一个文件(id.csv)由用户ID和第二个文件(data.csv)组成,由原始数据组成。我试图迭代第一个文件中的每个id,并从第二个文件中找到相同id的所有原始数据并写入新文件。我已经尝试了我的简单代码如下,但我认为这将需要超过一个月的时间来处理。请帮助实现可以更快处理的代码。

    public class FilterUser {

    public static String UniqueUser = "D:/test/id.csv";
    public static String Raw = "D:/test/data.csv";
    public static String OutputFile = "D:/test/output.csv";
    public static void main(String[] args) throws IOException 
    {
        Scanner ScanIn1 = null;
        String users = "";
        String[] record;
        ArrayList<String> InArray = new ArrayList<>();
        String line;
        long startTime = System.currentTimeMillis();

    try{    
        ScanIn1 =  new Scanner(new BufferedReader(new FileReader(UniqueUser)));
        BufferedReader br = new BufferedReader(new FileReader(Raw));
        BufferedWriter bw = new BufferedWriter(new FileWriter(OutputFile));
        bw.write("id,date,time,Use_duration,book1,book2");
        bw.newLine();

        while(ScanIn1.hasNext()){
            users = ScanIn1.nextLine();
            InArray.add(users);
        }
        while((line = br.readLine()) != null){
            record = line.split(",");
            for(int i=0; i<InArray.size(); i++){
                if(InArray.get(i).equals(record[0])){
                    String output = record[0] + "," + record[1] + "," + record[2] + "," + record[3] + "," + record[4]+ "," + record[5];
                    bw.write(output);
                    bw.newLine();
                }
            }
            }

        br.close();
        bw.close();
        ScanIn1.close();
        }
        catch (FileNotFoundException ex){
            System.out.println(ex);
        }
        catch (IOException ex){
            System.out.println(ex);
        }
    long endTime = System.currentTimeMillis();
    long TotalTime = endTime - startTime;
    System.out.println("Total time =" + TotalTime);
    }

}

id.csv

id.csv

data.csv

data.csv

1 个答案:

答案 0 :(得分:0)

您的代码可以使用hashSet重写rracing arraylist。因为使用hashSet的contains()方法,代码变得高效。 contains()方法的效率为O(1)。因此,您可以避免使用2个循环(while和for)进行验证。

public class FilterUser {

    public static String UniqueUser = "D:/test/id.csv";
    public static String Raw = "D:/test/data.csv";
    public static String OutputFile = "D:/test/output.csv";
    public static void main(String[] args) throws IOException 
    {
        Scanner ScanIn1 = null;
        String users = "";
        String id = "";
        HashSet<String> InArray = new HashSet<String>();
        String line;
        long startTime = System.currentTimeMillis();

    try{    
        ScanIn1 =  new Scanner(new BufferedReader(new FileReader(UniqueUser)));
        BufferedReader br = new BufferedReader(new FileReader(Raw));
        BufferedWriter bw = new BufferedWriter(new FileWriter(OutputFile));
        bw.write("id,date,time,Use_duration,book1,book2");
        bw.newLine();

        while(ScanIn1.hasNext()){
            users = ScanIn1.nextLine();
            InArray.add(users);
        }
        while((line = br.readLine()) != null){

            id=line.substring(0, 3);
            if(InArray.contains(id)){
                bw.write(line);
                bw.newLine();
            }
            }

        br.close();
        bw.close();
        ScanIn1.close();
        }
        catch (FileNotFoundException ex){
            System.out.println(ex);
        }
        catch (IOException ex){
            System.out.println(ex);
        }
    long endTime = System.currentTimeMillis();
    long TotalTime = endTime - startTime;
    System.out.println("Total time =" + TotalTime);
    }

}