我编写了一个类似于外部排序的程序。我从this blog得到了一个好主意。在这里,他们试图做外部排序的数字。我的要求略有不同。 我的输入文件可能有超过百万条记录,很难在内存中对它们进行排序,因此我必须使用我的磁盘。我将输入分成不同的切片,对其进行排序,然后将其存储在临时文件中。然后将排序的输出合并到一个文件中。下面我可以将其拆分为临时文件,然后只合并密钥。
我有一个输入文件如下:
key1 abc
key2 world
key1 hello
key3 tom
key7 yankie
key3 apple
key5 action
key7 jack
key4 apple
key2 xon
key1 lemon
假设磁盘上文件的大小为10,最大项目内存缓冲区可以容纳4,所以我所做的是一次取4条记录并存储在HashMap中,将我的值与更新后的计数一起排序。此输入将分为3个已排序的文件,如下所示。你可以看到,对于每个键我都有一个计数,也是字典值最高的值。
临时文件-0.txt
key1: 2, hello
key2: 1, world
key3: 1, tom
TEMP-文件的1.txt
key5: 1, action
key3: 1, apple
key7: 2, yankie
临时文件-2.txt
key1: 1, lemon
key2: 1, xon
key4: 1, apple
然后在合并所有这3个文件后,输出应如下所示:
key1: 3 lemon
key2: 2 xon
key3: 2 world
key5: 1 action
key7: 2 yankie
我不确定将整行与计数以及该键的字典值最高值合并的逻辑,我的下面代码能够给我所有键,如下所示:
key1
key1
key2
key2
key3
key4
key5
key3
key7
在下面的代码中,我打开每个文件并合并它们,然后写回磁盘到一个名为external-sorted.txt
的新文件
static int N = 10; // size of the file in disk
static int M = 4; // max items the memory buffer can hold
int slices = (int) Math.ceil((double) N/M);
String tfile = "temp-file-";
//Reading all the 3 temp files
BufferedReader[] brs = new BufferedReader[slices];
String[] topNums = new String[slices];
for(i = 0; i<slices; i++){
brs[i] = new BufferedReader(new FileReader(tfile + Integer.toString(i) + ".txt"));
String t = brs[i].readLine();
String[] kv = t.split(":");
if(t!=null){
topNums[i] = kv[0];
}
//topNums [key1, key5, key1]
}
FileWriter fw = new FileWriter("external-sorted.txt");
PrintWriter pw = new PrintWriter(fw);
for(i=0; i<N; i++){
String min = topNums[0];
System.out.println("min:"+min);
int minFile = 0;
for(j=0; j<slices; j++){
if(min.compareTo(topNums[j])>0)
{
min = topNums[j];
minFile = j;
}
}
pw.println(min);
String t = brs[minFile].readLine();
String[] kv = new String[2];
if (t != null)
kv = t.split(":");
topNums[minFile] = kv[0];
}
for (i = 0; i < slices; i++)
brs[i].close();
pw.close();
fw.close();
}
任何想法都表示赞赏。如果您有任何疑问,请询问。 TIA。
答案 0 :(得分:2)
嗯,这样的事情有效,我确定有更好的方法,但目前我还没有真正思考:
// Declare Scanner Object to read our file
Scanner in = new Scanner(new File(stringRepresentingLocationOfYourFileHere));
// create Map that will contain keys in sorted order (TreeMap)
// along with last value assigned to the key
Map<String, String> mapa = new TreeMap<>();
// another map to hold keys from first map and number of
// occurrences of those keys (repetitions), this could have been
// done using single Map as well, but whatever
Map<String, Integer> mapaDva = new HashMap<>();
// String array that will hold words of each line of our .txt file
String[] line;
// we loop until we reach end of our .txt file
while(in.hasNextLine()){
// check if map already contains given key, if it does
// increment value by 1 otherwise initialize the value with 1
if (mapa.put((line = in.nextLine().split(" "))[0], line[1]) != null)
mapaDva.put(line[0], mapaDva.get(line[0])+1);
else
mapaDva.put(line[0], 1);
}
// loop through our maps and print out keys, number of
//repetitions, last assigned value
for (Map.Entry<String, String> m : mapa.entrySet()){
System.out.println(m.getKey() + " " + mapaDva.get(m.getKey()) + " " + m.getValue());
}
如果有关于此代码的具体内容尚不清楚,请询问。
示例输入文件:
key1 abcd
key2 zzz
key1 tommy
key3 world
完成时输出:
key1 2 tommy
key2 1 zzz
key3 1 world
EDIT 2(处理多个文件时的解决方案):
// array of File objects that hold path to all your files to iterate through
File[] files = {new File("file1.txt"),
new File("file2.txt"),
new File("file3.txt")};
Scanner in;
Map<String, String> mapa = new TreeMap<>();
Map<String, Integer> mapaDva = new HashMap<>();
String[] line;
for (int i = 0; i < files.length; i++) {
// assign new File to Scanner on each iteration (go through our File array)
in = new Scanner(files[i]);
while(in.hasNextLine()){
if (mapa.put((line = in.nextLine().split(" "))[0], line[1]) != null)
mapaDva.put(line[0], mapaDva.get(line[0])+1);
else
mapaDva.put(line[0], 1);
}
}
for (Map.Entry<String, String> m : mapa.entrySet()){
System.out.println(m.getKey() + " " + mapaDva.get(m.getKey()) + " " + m.getValue());
}
所以我们将所有File对象存储在File数组中,然后我们遍历每一个,合并所有内容并打印出最终结果:
3个示例输入文件:
file1.txt
key1 abcd
key2 zzz
key1 tommy
key3 world
file2.txt
key1 abc
key3 xxx
key1 tommy
key6 denver
<强> file3.txt 强>
key5 lol
key8 head
key6 tommy
key6 denver
<强>输出:强>
key1 4 tommy
key2 1 zzz
key3 2 xxx
key5 1 lol
key6 3 denver
key8 1 head