我有一个包含二维矩阵的文本文件。它看起来如下。
01 02 03 04 05
06 07 08 09 10
11 12 13 14 15
16 17 18 19 20
如您所见,每行由一个新行分隔,每列由一个空格分隔。我需要以有效的方式转置这个矩阵。
01 06 11 16
02 07 12 17
03 08 04 05
04 09 14 19
05 10 15 20
实际上,矩阵是10,000乘14,000。单个元素是双重/浮动。如果不是不可能的话,尝试将这个文件/矩阵全部转置到内存中将是昂贵的。
有没有人知道一个util API来做这样的事情还是一种有效的方法?
我所尝试过的:我的天真方法是为每列(转置矩阵)创建一个临时文件。所以,有10,000行,我将有10,000个临时文件。当我读取每一行时,我将每个值标记,并将值附加到相应的文件。所以在上面的例子中,我将有如下内容。
file-0: 01 06 11 16
file-1: 02 07 12 17
file-3: 03 08 13 18
file-4: 04 09 14 19
file-5: 05 10 15 20
然后我读回每个文件并将它们附加到一个文件中。我想知道是否有更聪明的方法因为我知道文件i / o操作将是一个痛点。
答案 0 :(得分:1)
具有最小内存消耗和极低性能的解决方案:
import org.apache.commons.io.FileUtils;
import java.io.BufferedWriter;
import java.io.File;
import java.io.FileWriter;
import java.io.IOException;
public class MatrixTransposer {
private static final String TMP_DIR = System.getProperty("java.io.tmpdir") + "/";
private static final String EXTENSION = ".matrix.tmp.result";
private final String original;
private final String dst;
public MatrixTransposer(String original, String dst) {
this.original = original;
this.dst = dst;
}
public void transpose() throws IOException {
deleteTempFiles();
int max = 0;
FileReader fileReader = null;
BufferedReader reader = null;
try {
fileReader = new FileReader(original);
reader = new BufferedReader(fileReader);
String row;
while((row = reader.readLine()) != null) {
max = appendRow(max, row, 0);
}
} finally {
if (null != reader) reader.close();
if (null != fileReader) fileReader.close();
}
mergeResultingRows(max);
}
private void deleteTempFiles() {
for (String tmp : new File(TMP_DIR).list()) {
if (tmp.endsWith(EXTENSION)) {
FileUtils.deleteQuietly(new File(TMP_DIR + "/" + tmp));
}
}
}
private void mergeResultingRows(int max) throws IOException {
FileUtils.deleteQuietly(new File(dst));
FileWriter writer = null;
BufferedWriter out = null;
try {
writer = new FileWriter(new File(dst), true);
out = new BufferedWriter(writer);
for (int i = 0; i <= max; i++) {
out.write(FileUtils.readFileToString(new File(TMP_DIR + i + EXTENSION)) + "\r\n");
}
} finally {
if (null != out) out.close();
if (null != writer) writer.close();
}
}
private int appendRow(int max, String row, int i) throws IOException {
for (String element : row.split(" ")) {
FileWriter writer = null;
BufferedWriter out = null;
try {
writer = new FileWriter(TMP_DIR + i + EXTENSION, true);
out = new BufferedWriter(writer);
out.write(columnPrefix(i) + element);
} finally {
if (null != out) out.close();
if (null != writer) writer.close();
}
max = Math.max(i++, max);
}
return max;
}
private String columnPrefix(int i) {
return (0 == i ? "" : " ");
}
public static void main(String[] args) throws IOException {
new MatrixTransposer("c:/temp/mt/original.txt", "c:/temp/mt/transposed.txt").transpose();
}
}
答案 1 :(得分:0)
总大小为1.12GB(如果是双倍),如果浮动,则为一半。这对于今天的机器而言足够小,你可以在内存中完成它。但是,您可能希望就地进行换位,这是一项非常重要的任务。 wikipedia article提供了进一步的链接。
答案 2 :(得分:0)
我建议您在不消耗大量内存的情况下评估可以阅读的列数。然后通过读取包含列数的块的几次源文件来编写最终文件。我们假设您有10000列。首先,您读取集合中源文件的0到250列,然后在最终文件中写入。然后再次对250到500列执行此操作,依此类推。
public class TransposeMatrixUtils {
private static final Logger logger = LoggerFactory.getLogger(TransposeMatrixUtils.class);
// Max number of bytes of the src file involved in each chunk
public static int MAX_BYTES_PER_CHUNK = 1024 * 50_000;// 50 MB
public static File transposeMatrix(File srcFile, String separator) throws IOException {
File output = File.createTempFile("output", ".txt");
transposeMatrix(srcFile, output, separator);
return output;
}
public static void transposeMatrix(File srcFile, File destFile, String separator) throws IOException {
long bytesPerColumn = assessBytesPerColumn(srcFile, separator);// rough assessment of bytes par column
int nbColsPerChunk = (int) (MAX_BYTES_PER_CHUNK / bytesPerColumn);// number of columns per chunk according to the limit of bytes to be used per chunk
if (nbColsPerChunk == 0) nbColsPerChunk = 1;// in case a single column has more bytes than the limit ...
logger.debug("file length : {} bytes. max bytes per chunk : {}. nb columns per chunk : {}.", srcFile.length(), MAX_BYTES_PER_CHUNK, nbColsPerChunk);
try (FileWriter fw = new FileWriter(destFile); BufferedWriter bw = new BufferedWriter(fw)) {
boolean remainingColumns = true;
int offset = 0;
while (remainingColumns) {
remainingColumns = writeColumnsInRows(srcFile, bw, separator, offset, nbColsPerChunk);
offset += nbColsPerChunk;
}
}
}
private static boolean writeColumnsInRows(File srcFile, BufferedWriter bw, String separator, int offset, int nbColumns) throws IOException {
List<String>[] newRows;
boolean remainingColumns = true;
try (FileReader fr = new FileReader(srcFile); BufferedReader br = new BufferedReader(fr)) {
String[] split0 = br.readLine().split(separator);
if (split0.length <= offset + nbColumns) remainingColumns = false;
int lastColumnIndex = Math.min(split0.length, offset + nbColumns);
logger.debug("chunk for column {} to {} among {}", offset, lastColumnIndex, split0.length);
newRows = new List[lastColumnIndex - offset];
for (int i = 0; i < newRows.length; i++) {
newRows[i] = new ArrayList<>();
newRows[i].add(split0[i + offset]);
}
String line;
while ((line = br.readLine()) != null) {
String[] split = line.split(separator);
for (int i = 0; i < newRows.length; i++) {
newRows[i].add(split[i + offset]);
}
}
}
for (int i = 0; i < newRows.length; i++) {
bw.write(newRows[i].get(0));
for (int j = 1; j < newRows[i].size(); j++) {
bw.write(separator);
bw.write(newRows[i].get(j));
}
bw.newLine();
}
return remainingColumns;
}
private static long assessBytesPerColumn(File file, String separator) throws IOException {
try (FileReader fr = new FileReader(file); BufferedReader br = new BufferedReader(fr)) {
int nbColumns = br.readLine().split(separator).length;
return file.length() / nbColumns;
}
}
}
它应该比创建大量临时文件更有效,这些文件将产生大量的I / O.
对于10000 x 14000矩阵的示例,此代码需要3分钟才能创建转置文件。如果您设置MAX_BYTES_PER_CHUNK = 1024 * 100_000
而不是1024 * 50_000
,则需要2分钟,但当然会占用更多内存。