Question

我有一个列表（在制表符分隔的.txt文件中），如：

row   col   value
1     1     3.2
10    2     5.3
25    3     2.2
30    1     5.3

等

我想把它变成稀疏矩阵，如：

    1    2    3
1   3.2  
10       5.3 
25            2.2
30  5.3

然后填写零。

使用Hadoop最简单的方法是什么？（我需要使用Hadoop，因为矩阵的大小约为3 Tb ......）

Answer 1

您可以使用Hive或Pig。以下是使用Pig的示例：

A = load 'input.txt' USING PigStorage('\t') AS (row:long, col:int, value:float);
B = foreach a generate SOMEUDF(A);
store B into 'output.txt';

然后你只需要定义一个UDF：

public class SOMEUDF extends EvalFunc <Tuple>
{
    public Tuple exec(Tuple input) throws IOException {
        if (input == null || input.size() == 0)
            return null;
        try{
            // Generate the matrix line here and return.
        }catch(Exception e){
            throw WrappedIOException.wrap("Caught exception processing input row ", e);
        }
    }
}

使用Hadoop列出到Matrix

1 个答案: