什么是在文本文件中转置矩阵的有效方法?

时间:2012-03-20 07:37:42

标签: java matrix transpose

我有一个包含二维矩阵的文本文件。它看起来如下。

01 02 03 04 05
06 07 08 09 10
11 12 13 14 15
16 17 18 19 20

如您所见,每行由一个新行分隔,每列由一个空格分隔。我需要以有效的方式转置这个矩阵。

01 06 11 16
02 07 12 17
03 08 04 05
04 09 14 19
05 10 15 20

实际上,矩阵是10,000乘14,000。单个元素是双重/浮动。如果不是不可能的话,尝试将这个文件/矩阵全部转置到内存中将是昂贵的。

有没有人知道一个util API来做这样的事情还是一种有效的方法?

我所尝试过的:我的天真方法是为每列(转置矩阵)创建一个临时文件。所以,有10,000行,我将有10,000个临时文件。当我读取每一行时,我将每个值标记,并将值附加到相应的文件。所以在上面的例子中,我将有如下内容。

file-0: 01 06 11 16
file-1: 02 07 12 17
file-3: 03 08 13 18
file-4: 04 09 14 19
file-5: 05 10 15 20
然后我读回每个文件并将它们附加到一个文件中。我想知道是否有更聪明的方法因为我知道文件i / o操作将是一个痛点。

3 个答案:

答案 0 :(得分:1)

具有最小内存消耗和极低性能的解决方案:

import org.apache.commons.io.FileUtils;

import java.io.BufferedWriter;
import java.io.File;
import java.io.FileWriter;
import java.io.IOException;

public class MatrixTransposer {

  private static final String TMP_DIR = System.getProperty("java.io.tmpdir") + "/";
  private static final String EXTENSION = ".matrix.tmp.result";
  private final String original;
  private final String dst;

  public MatrixTransposer(String original, String dst) {
    this.original = original;
    this.dst = dst;
  }

  public void transpose() throws IOException {

    deleteTempFiles();

    int max = 0;

    FileReader fileReader = null;
    BufferedReader reader = null;
    try {
      fileReader = new FileReader(original);
      reader = new BufferedReader(fileReader);
      String row;
      while((row = reader.readLine()) != null) {

        max = appendRow(max, row, 0);
      }
    } finally {
      if (null != reader) reader.close();
      if (null != fileReader) fileReader.close();
    }


    mergeResultingRows(max);
  }

  private void deleteTempFiles() {
    for (String tmp : new File(TMP_DIR).list()) {
      if (tmp.endsWith(EXTENSION)) {
        FileUtils.deleteQuietly(new File(TMP_DIR + "/" + tmp));
      }
    }
  }

  private void mergeResultingRows(int max) throws IOException {

    FileUtils.deleteQuietly(new File(dst));

    FileWriter writer = null;
    BufferedWriter out = null;

    try {
      writer = new FileWriter(new File(dst), true);
      out = new BufferedWriter(writer);
      for (int i = 0; i <= max; i++) {
        out.write(FileUtils.readFileToString(new File(TMP_DIR + i + EXTENSION)) + "\r\n");
      }
    } finally {
      if (null != out) out.close();
      if (null != writer) writer.close();
    }
  }

  private int appendRow(int max, String row, int i) throws IOException {

    for (String element : row.split(" ")) {

      FileWriter writer = null;
      BufferedWriter out = null;
      try {
        writer = new FileWriter(TMP_DIR + i + EXTENSION, true);
        out = new BufferedWriter(writer);
        out.write(columnPrefix(i) + element);
      } finally {
        if (null != out) out.close();
        if (null != writer) writer.close();
      }
      max = Math.max(i++, max);
    }
    return max;
  }

  private String columnPrefix(int i) {

    return (0 == i ? "" : " ");
  }

  public static void main(String[] args) throws IOException {

    new MatrixTransposer("c:/temp/mt/original.txt", "c:/temp/mt/transposed.txt").transpose();
  }
}

答案 1 :(得分:0)

总大小为1.12GB(如果是双倍),如果浮动,则为一半。这对于今天的机器而言足够小,你可以在内存中完成它。但是,您可能希望就地进行换位,这是一项非常重要的任务。 wikipedia article提供了进一步的链接。

答案 2 :(得分:0)

我建议您在不消耗大量内存的情况下评估可以阅读的列数。然后通过读取包含列数的块的几次源文件来编写最终文件。我们假设您有10000列。首先,您读取集合中源文件的0到250列,然后在最终文件中写入。然后再次对250到500列执行此操作,依此类推。

public class TransposeMatrixUtils {

    private static final Logger logger = LoggerFactory.getLogger(TransposeMatrixUtils.class);

    // Max number of bytes of the src file involved in each chunk
    public static int MAX_BYTES_PER_CHUNK = 1024 * 50_000;// 50 MB

    public static File transposeMatrix(File srcFile, String separator) throws IOException {
        File output = File.createTempFile("output", ".txt");
        transposeMatrix(srcFile, output, separator);
        return output;
    }

    public static void transposeMatrix(File srcFile, File destFile, String separator) throws IOException {
        long bytesPerColumn = assessBytesPerColumn(srcFile, separator);// rough assessment of bytes par column
        int nbColsPerChunk = (int) (MAX_BYTES_PER_CHUNK / bytesPerColumn);// number of columns per chunk according to the limit of bytes to be used per chunk
        if (nbColsPerChunk == 0) nbColsPerChunk = 1;// in case a single column has more bytes than the limit ...
        logger.debug("file length : {} bytes. max bytes per chunk : {}. nb columns per chunk : {}.", srcFile.length(), MAX_BYTES_PER_CHUNK, nbColsPerChunk);
        try (FileWriter fw = new FileWriter(destFile); BufferedWriter bw = new BufferedWriter(fw)) {
            boolean remainingColumns = true;
            int offset = 0;
            while (remainingColumns) {
                remainingColumns = writeColumnsInRows(srcFile, bw, separator, offset, nbColsPerChunk);
                offset += nbColsPerChunk;
            }
        }
    }

    private static boolean writeColumnsInRows(File srcFile, BufferedWriter bw, String separator, int offset, int nbColumns) throws IOException {
        List<String>[] newRows;
        boolean remainingColumns = true;
        try (FileReader fr = new FileReader(srcFile); BufferedReader br = new BufferedReader(fr)) {
            String[] split0 = br.readLine().split(separator);
            if (split0.length <= offset + nbColumns) remainingColumns = false;
            int lastColumnIndex = Math.min(split0.length, offset + nbColumns);
            logger.debug("chunk for column {} to {} among {}", offset, lastColumnIndex, split0.length);
            newRows = new List[lastColumnIndex - offset];
            for (int i = 0; i < newRows.length; i++) {
                newRows[i] = new ArrayList<>();
                newRows[i].add(split0[i + offset]);
            }
            String line;
            while ((line = br.readLine()) != null) {
                String[] split = line.split(separator);
                for (int i = 0; i < newRows.length; i++) {
                    newRows[i].add(split[i + offset]);
                }
            }
        }
        for (int i = 0; i < newRows.length; i++) {
            bw.write(newRows[i].get(0));
            for (int j = 1; j < newRows[i].size(); j++) {
                bw.write(separator);
                bw.write(newRows[i].get(j));
            }
            bw.newLine();
        }
        return remainingColumns;
    }

    private static long assessBytesPerColumn(File file, String separator) throws IOException {
        try (FileReader fr = new FileReader(file); BufferedReader br = new BufferedReader(fr)) {
            int nbColumns = br.readLine().split(separator).length;
            return file.length() / nbColumns;
        }
    }

}

它应该比创建大量临时文件更有效,这些文件将产生大量的I / O.

对于10000 x 14000矩阵的示例,此代码需要3分钟才能创建转置文件。如果您设置MAX_BYTES_PER_CHUNK = 1024 * 100_000而不是1024 * 50_000,则需要2分钟,但当然会占用更多内存。