Question

我有一个60 MB的文本文件，我的程序通过该文件搜索特定的ID并提取一些相关的文本。我必须重复200多个ID的过程。最初，我使用循环遍历文件的行并查找ID然后提取相关文本，但它需要太长时间（约2分钟）。所以相反，现在我正在寻找将整个文件加载到内存中的方法，然后从那里搜索我的ID和相关文本;我想这应该比访问硬盘200倍以上更快。所以我编写了以下代码将文件加载到内存中：

public String createLocalFile(String path)
{   
    String text = "";
    try
    {
        FileReader fileReader = new FileReader( path );
        BufferedReader reader = new BufferedReader( fileReader );
        String currentLine = "";
        while( (currentLine = reader.readLine() ) != null )
        {
            text += currentLine;
            System.out.println( currentLine );
        }

    }
    catch(IOException ex)
    {
        System.out.println(ex.getMessage());
    }
    return text;
}

不幸的是，将文件的文本保存到String变量中需要很长时间。如何加快文件加载速度？或者有更好的方法来完成相同的任务吗？谢谢你的帮助。

编辑：以下是文件https://github.com/MVZSEQ/denovoTranscriptomeMarkerDevelopment/blob/master/Homo_sapiens.GRCh38.pep.all.fa

的链接

典型的行看起来像：

>ENSP00000471873 pep:putative chromosome:GRCh38:19:49496434:49499689:1 gene:ENSG00000142534 transcript:ENST00000594493 gene_biotype:protein_coding transcript_biotype:protein_coding\
MKMQRTIVIRRDYLHYIRKYNRFEKRHKNMSVHLSPCFRDVQIGDIVTVGECRPLSKTVR\
FNVLKVTKAAGTKKQFQKF\

ENSP00000471873是ID，我要提取的文字是

MKMQRTIVIRRDYLHYIRKYNRFEKRHKNMSVHLSPCFRDVQIGDIVTVGECRPLSKTVR\
    FNVLKVTKAAGTKKQFQKF\

Answer 1

如果文件包含一组记录，那么你可以创建一个具有id和text内容属性的类。
2.从文件中读取每条记录并从中创建一个对象并将其添加到HashMap中。
3。使用HashMap按ID

检索对象

Answer 2

同意大多数其他意见。对于今天的回忆，60 MB并不算太大。但是时间被吸引的地方几乎肯定是“+ =”将每一行附加到一个越来越可怕的单个字符串。制作一系列行。

更好的是，在阅读时将ID文本和“相关文本”分开，以便更快地进行后续ID搜索。哈希表是理想的。

Answer 3

你肯定在正确的轨道上认为你应该把它读入内存并通过某种映射来访问它。这将消除很多瓶颈，即磁盘I / O和访问时间（内存更快）。

我建议将数据读入HashMap，ID为关键，而Text为值。

尝试类似：

public Map<Integer, String> getIdMap(final String pathToFile) throws IOException {
    // we'll use this later to store our mappings
    final Map<Integer, String> map = new HashMap<Integer, String>();
    // read the file into a String
    final String rawFileContents = new String(Files.readAllBytes(Paths.get(pathToFile)));
    // assumes each line is an ID + value
    final String[] fileLines = rawFileContents.split(System.getProperty("line.separator"));
    // iterate over every line, and create a mapping for the ID to Value
    for (final String line : fileLines) {
        Integer id = null;
        try {
            // assumes the id is part 1 of a 2 part line in CSV "," format
            id = Integer.parseInt(line.split(",")[0]);
        } catch (NumberFormatException e) {
            e.printStackTrace();
        }
        // assumes the value is part 2 of a 2 part line in CSV "," format
        final String value = line.split(",")[1];
        // put the pair into our map
        map.put(id, value);
    }
    return map;
}

这会将文件读入内存（在字符串中），然后将其剪切为Map，以便检索值，例如：

Map<Integer, String> map = getIdMap("/path/to/file");
final String theText = map.get(theId);
System.out.println(theText);

此示例代码未经测试，并对您的文件格式做出一些假设，即它是一行ID和每行值，并且ID和值是逗号分隔（CSV）。当然，如果您的数据结构略有不同，只需调整一下即可。

更新以匹配您的文件说明：

public Map<String, String> getIdMap(final String pathToFile) throws IOException {
    // we'll use this later to store our mappings
    final Map<String, String> map = new HashMap<String, String>();
    // read the file into a String
    final String rawFileContents = new String(Files.readAllBytes(Paths.get(pathToFile)));
    // assumes each line is an ID + value
    final String[] fileLines = rawFileContents.split(System.getProperty("line.separator"));
    // iterate over every line, and create a mapping for the ID to Value
    for (final String line : fileLines) {
        // get the id and remove the leading '>' symbol
        final String id = line.split(" ")[0].replace(">", "").trim();
        // use the key 'transcript_biotype:' to get the 'IG_D_gene' value
        final String value = line.split("transcript_biotype:")[1].trim();
        // put the pair into our map
        map.put(id, value);
    }
    return map;
}

Answer 4

假设您的VM分配了足够的堆，您可以将原始文件加载到内存中，如下所示：

public byte[] loadFile(File f) throws IOException {
    long size = f.length();
    InputStream source;
    byte[] bytes;
    int nread;
    int next;

    if (size > Integer.MAX_VALUE) {
        throw new IllegalArgumentException("file to long");
    }
    bytes = new byte[(int)size];

    source = new FileInputStream(f);

    for (next = 0; next < bytes.length; next += nread) {
        nread = source.read(bytes, next, bytes.length - next);
        if (nread < 0) {
            throw new FileTruncatedWhileReadingItException();
            // or whatever ...
        }
    }
    if (source.read() != -1) {
        throw new FileExtendedWhileReadingItException(); 
        // or whatever ...
    }

    return bytes;
}

然后，您可以通过在其周围创建ByteArrayInputStream来处理内存中的副本而不是从磁盘读取 - 您应该能够相对轻松地将其插入到现有代码中。

可能还有其他方法可以进一步优化。例如，如果处理数据必然涉及将它们解码为字符，那么您可以使用Reader来读取解码结果以读取char[]而不是InputStream来读取进入byte[]，然后进行类似的处理。但请注意，以char形式存储ASCII数据所需的空间是以byte形式存储的两倍。

如果数据合适，那么对一些更复杂的数据结构（例如Map）执行完整解析可能会很有用，这可能使后续查找非常快。当然，价格是更多的内存使用。

Answer 5

我认为你的问题来自于在文本上添加字符串。您应该使用StringBuffer代替。我还建议您使用Scanner课程代替FileReader：

public String createLocalFile(String path)
{   
    StringBuffer text = new StringBuffer();
    try
    {
        Scanner sc = new Scanner( new File(path) );
        while( sc.hasNext() )
        {
            String currentLine = sc.nextLine();
            text.append(currentLine);
            System.out.println( currentLine );
        }

    }
    catch(IOException ex)
    {
        System.out.println(ex.getMessage());
    }
    return text.toString();
}

那应该快得多。

Answer 6

您正在使用的是FASTA文件。试试BioPerl ...有大量的库可以解析并使用这些类型的文件。无论你在做什么，它很可能已经完成....

将文件加载到内存（Java）？

6 个答案: