鉴于有一些文件Customer-1.txt,Customer-2.txt和Customer-3.txt,这些文件具有以下内容:
Customer-1.txt
1|1|MARY|SMITH
2|1|PATRICIA|JOHNSON
4|2|BARBARA|JONES
Customer-2.txt
1|1|MARY|SMITH
2|1|PATRICIA|JOHNSON
3|1|LINDA|WILLIAMS
4|2|BARBARA|JONES
Customer-3.txt
2|1|PATRICIA|JOHNSON
3|1|LINDA|WILLIAMS
5|2|ALEXANDER|ANDERSON
这些文件有很多重复数据,但每个文件可能包含一些唯一的数据。
鉴于实际文件已排序,大(每个文件几GB)并且文件很多......
然后是什么:
a)记忆最便宜的
b)cpu最便宜的
c)最快的
在Java中创建一个文件的方式,这三个文件将包含每个文件的所有唯一数据,如下所示排序和连接:
客户-final.txt
1|1|MARY|SMITH
2|1|PATRICIA|JOHNSON
3|1|LINDA|WILLIAMS
4|2|BARBARA|JONES
5|2|ALEXANDER|ANDERSON
我查看了以下解决方案https://github.com/upcrob/spring-batch-sort-merge,但我想知道是否有可能使用FileInputStream和/或非弹簧批处理解决方案。
由于文件大小和缺少实际数据库,使用内存或真实数据库加入它们的解决方案对我的用例来说是不可行的。
答案 0 :(得分:1)
由于输入文件已经排序,因此合并其内容的文件的简单并行迭代是内存最便宜, cpu最便宜,并且最快这样做的方法。
这是一种多向合并连接,即没有“排序”的排序合并连接,可以消除重复,类似于SQL DISTINCT
。
这是一个可以执行无限数量输入文件的版本(好吧,无论如何都可以打开文件)。它使用一个辅助类来为每个输入文件分配下一行,因此每行只需要解析一次前导ID值。
private static void merge(StringWriter out, BufferedReader ... in) throws IOException {
CustomerReader[] customerReader = new CustomerReader[in.length];
for (int i = 0; i < in.length; i++)
customerReader[i] = new CustomerReader(in[i]);
merge(out, customerReader);
}
private static void merge(StringWriter out, CustomerReader ... in) throws IOException {
List<CustomerReader> min = new ArrayList<>(in.length);
for (;;) {
min.clear();
for (CustomerReader reader : in)
if (reader.hasData()) {
int cmp = (min.isEmpty() ? 0 : reader.compareTo(min.get(0)));
if (cmp < 0)
min.clear();
if (cmp <= 0)
min.add(reader);
}
if (min.isEmpty())
break; // all done
// optional: Verify that lines that compared equal by ID are entirely equal
out.write(min.get(0).getCustomerLine());
out.write(System.lineSeparator());
for (CustomerReader reader : min)
reader.readNext();
}
}
private static final class CustomerReader implements Comparable<CustomerReader> {
private BufferedReader in;
private String customerLine;
private int customerId;
CustomerReader(BufferedReader in) throws IOException {
this.in = in;
readNext();
}
void readNext() throws IOException {
if ((this.customerLine = this.in.readLine()) == null)
this.customerId = Integer.MAX_VALUE;
else
this.customerId = Integer.parseInt(this.customerLine.substring(0, this.customerLine.indexOf('|')));
}
boolean hasData() {
return (this.customerLine != null);
}
String getCustomerLine() {
return this.customerLine;
}
@Override
public int compareTo(CustomerReader that) {
// Order by customerId only. Inconsistent with equals()
return Integer.compare(this.customerId, that.customerId);
}
}
<强> TEST 强>
String file1data = "1|1|MARY|SMITH\n" +
"2|1|PATRICIA|JOHNSON\n" +
"4|2|BARBARA|JONES\n";
String file2data = "1|1|MARY|SMITH\n" +
"2|1|PATRICIA|JOHNSON\n" +
"3|1|LINDA|WILLIAMS\n" +
"4|2|BARBARA|JONES\n";
String file3data = "2|1|PATRICIA|JOHNSON\n" +
"3|1|LINDA|WILLIAMS\n" +
"5|2|ALEXANDER|ANDERSON\n";
try (
BufferedReader in1 = new BufferedReader(new StringReader(file1data));
BufferedReader in2 = new BufferedReader(new StringReader(file2data));
BufferedReader in3 = new BufferedReader(new StringReader(file3data));
StringWriter out = new StringWriter();
) {
merge(out, in1, in2, in3);
System.out.print(out);
}
<强>输出强>
1|1|MARY|SMITH
2|1|PATRICIA|JOHNSON
3|1|LINDA|WILLIAMS
4|2|BARBARA|JONES
5|2|ALEXANDER|ANDERSON
代码完全按ID值合并,并不验证行的其余部分实际上是否相等。如果需要,请在optional
评论处插入代码以检查该代码。
答案 1 :(得分:0)
这可能会有所帮助:
public static void main(String[] args) {
String files[] = {"Customer-1.txt", "Customer-2.txt", "Customer-3.txt"};
HashMap<Integer, String> customers = new HashMap<Integer, String>();
try {
String line;
for(int i = 0; i < files.length; i++) {
BufferedReader reader = new BufferedReader(new FileReader("data/" + files[i]));
while((line = reader.readLine()) != null) {
Integer uuid = Integer.valueOf(line.split("|")[0]);
customers.put(uuid, line);
}
reader.close();
}
BufferedWriter writer = new BufferedWriter(new FileWriter("data/Customer-final.txt"));
Iterator<String> it = customers.values().iterator();
while(it.hasNext()) writer.write(it.next() + "\n");
writer.close();
} catch (Exception e) {
e.printStackTrace();
}
}
如果您有任何问题请问我。