目前我的任务是制作一个可以使用java检查链接是否正确的工具。该链接来自Jericho HTML Parser,我的工作只是检查文件是否存在/链接是否正确。那部分已经完成,困难的部分是优化它,因为我的代码运行(我不得不说)在每次运行65ms时相当缓慢
public static String checkRelativeURL(String originalFileLoc, String relativeLoc){
StringBuilder sb = new StringBuilder();
String absolute = Common.relativeToAbsolute(originalFileLoc, relativeLoc); //built in function to replace the link from relative link to absolute path
sb.append(absolute);
sb.append("\t");
try {
Path path = Paths.get(absolute);
sb.append(Files.exists(path));
}catch (InvalidPathException | NullPointerException ex) {
sb.append(false);
}
sb.append("\t");
return sb.toString();
}
并在这一行花了65毫秒
Path path = Paths.get(absolute);
sb.append(Files.exists(path));
我尝试过使用
File file = new File(absolute);
sb.append(file.isFile());
它仍然在65~100ms左右跑。
那么有没有其他更快的方法来检查文件是否存在而不是这个?
由于我正在处理超过70k个html文件并且每毫秒都很重要,谢谢:(
修改
我尝试将所有文件列入某个列表,但它并没有真正帮助,因为列出所有文件只需要20多分钟....
我用来列出所有文件的代码
static public void listFiles2(String filepath){
Path path = Paths.get(filepath);
File file = null;
String pathString = new String();
try {
if(path.toFile().isDirectory()){
DirectoryStream<Path> stream = Files.newDirectoryStream(path);
for(Path entry : stream){
file = entry.toFile();
pathString = entry.toString();
if(file.isDirectory()){
listFiles2(pathString);
}
if (file.isFile()){
filesInProject.add(pathString);
System.out.println(pathString);
}
}
stream.close();
}
} catch (IOException e) {
e.printStackTrace();
}
}
答案 0 :(得分:1)
如果您事先知道目标操作系统集(通常是这种情况),最快的方法是通过调用一个进程,例如通过shell列出这么多文件。使用Runtime.exec。
在Windows上,您可以使用
dir /s /b
在Linux上
ls -R -1
您可以check what is the OS并使用适当的命令(如果不支持则错误或诉诸目录流)。
如果您希望简单并且不需要报告进度,则可以避免处理进程IO并将列表存储到临时文件中,例如ls -R -1 > /tmp/filelist.txt
。或者,您可以直接从过程输出中读取。使用缓冲流,读取器或类似物读取,具有足够大的缓冲区。
在SSD上,它将在几秒钟内完成,并在几秒钟内完成现代硬盘驱动器(这种方法不会导致50万个文件出现问题)。
获得列表后,您可以根据最大文件数和内存要求对其进行不同的处理。如果要求松散,例如桌面程序,您可以使用非常简单的代码,例如将完整文件列表预加载到HashSet并在需要时检查是否存在。通过删除公共根来缩短路径将需要更少的内存。您还可以通过仅保留文件名哈希而不是全名来减少内存(常见的根删除可能会减少更多)。
或者如果你愿意,你可以进一步优化它,现在问题只是减少了检查存储在内存或文件中的字符串列表中字符串存在的问题,这有很多众所周知的最佳解决方案。
Bellow是非常宽松,简单的Windows样本。它在HDD(而不是SSD)驱动器根目录上执行带有~400K文件的dir,读取列表和基准(好吧,种类)时间和内存用于字符串集和md5设置方法:
public static void main(String args[]) throws Exception {
final Runtime rt = Runtime.getRuntime();
System.out.println("mem " + (rt.totalMemory() - rt.freeMemory())
/ (1024 * 1024) + " Mb");
long time = System.currentTimeMillis();
// windows command: cd to t:\ and run recursive dir
Process p = rt.exec("cmd /c \"t: & dir /s /b > filelist.txt\"");
if (p.waitFor() != 0)
throw new Exception("command has failed");
System.out.println("done executing shell, took "
+ (System.currentTimeMillis() - time) + "ms");
System.out.println();
File f = new File("T:/filelist.txt");
// load into hash set
time = System.currentTimeMillis();
Set<String> fileNames = new HashSet<String>(500000);
try (BufferedReader reader = new BufferedReader(new InputStreamReader(
new FileInputStream(f), StandardCharsets.UTF_8),
50 * 1024 * 1024)) {
for (String line = reader.readLine(); line != null; line = reader
.readLine()) {
fileNames.add(line);
}
}
System.out.println(fileNames.size() + " file names loaded took "
+ (System.currentTimeMillis() - time) + "ms");
System.gc();
System.out.println("mem " + (rt.totalMemory() - rt.freeMemory())
/ (1024 * 1024) + " Mb");
time = System.currentTimeMillis();
// check files
for (int i = 0; i < 70_000; i++) {
StringBuilder fileToCheck = new StringBuilder();
while (fileToCheck.length() < 256)
fileToCheck.append(Double.toString(Math.random()));
if (fileNames.contains(fileToCheck))
System.out.println("to prevent optimization, never executes");
}
System.out.println();
System.out.println("hash set 70K checks took "
+ (System.currentTimeMillis() - time) + "ms");
System.gc();
System.out.println("mem " + (rt.totalMemory() - rt.freeMemory())
/ (1024 * 1024) + " Mb");
// Test memory/performance with MD5 hash set approach instead of full
// names
time = System.currentTimeMillis();
Set<String> nameHashes = new HashSet<String>(50000);
MessageDigest md5 = MessageDigest.getInstance("MD5");
for (String name : fileNames) {
String nameMd5 = new String(md5.digest(name
.getBytes(StandardCharsets.UTF_8)), StandardCharsets.UTF_8);
nameHashes.add(nameMd5);
}
System.out.println();
System.out.println(fileNames.size() + " md5 hashes created, took "
+ (System.currentTimeMillis() - time) + "ms");
fileNames.clear();
fileNames = null;
System.gc();
Thread.sleep(100);
System.gc();
System.out.println("mem " + (rt.totalMemory() - rt.freeMemory())
/ (1024 * 1024) + " Mb");
time = System.currentTimeMillis();
// check files
for (int i = 0; i < 70_000; i++) {
StringBuilder fileToCheck = new StringBuilder();
while (fileToCheck.length() < 256)
fileToCheck.append(Double.toString(Math.random()));
String md5ToCheck = new String(md5.digest(fileToCheck.toString()
.getBytes(StandardCharsets.UTF_8)), StandardCharsets.UTF_8);
if (nameHashes.contains(md5ToCheck))
System.out.println("to prevent optimization, never executes");
}
System.out.println("md5 hash set 70K checks took "
+ (System.currentTimeMillis() - time) + "ms");
System.gc();
System.out.println("mem " + (rt.totalMemory() - rt.freeMemory())
/ (1024 * 1024) + " Mb");
}
输出:
mem 3 Mb
done executing shell, took 5686ms
403108 file names loaded took 382ms
mem 117 Mb
hash set 70K checks took 283ms
mem 117 Mb
403108 md5 hashes created, took 486ms
mem 52 Mb
md5 hash set 70K checks took 366ms
mem 48 Mb