将猪中每n行行值相加

时间:2017-06-01 10:20:10

标签: java hadoop apache-pig bigdata

我有这样的数据。

0.20+0.50 = 0.70
0.90+0.10 = 1.0

我试图将每2行最后一列的值加起来这样。

 1:23:0.20:0.70
 2:34:0.50:0.70
 3:67:0.90:1.0
 4:87:0.10:1.0
 5:23:0.12

并像这样打印

 data = LOAD '/home/user/Documents/test/test.txt' using PigStorage(':') AS (tag:int,rssi:chararray,weightage:chararray,seqnum:int);
B = FOREACH (GROUP data ALL) {
A_ordered = ORDER data BY rssi;
GENERATE FLATTEN(CUSTOM_UDF(A_ordered));
}

这是我的猪脚本

this is what I tried.

public List<String> sumValues() {
    List<String> processedList = new ArrayList<>();
    if (entries == null) {
        return processedList;
    } else {
        double columnSum = 0;
        List<String> tempList = new ArrayList<>(); 
        int length = entries.size();
        for (int index = 1; index <= length; index++) {
            tempList.add(entries.get(index - 1)); 
            String[] splitValues = entries.get(index - 1).split(DELIMITER);
            if (splitValues.length >= MIN_SPLIT_STRING_LENGTH) {

                try {
                    double lastValue = Double.parseDouble(splitValues[WEIGHTAGE_INDEX]);
                    columnSum = columnSum + lastValue;

                    if ((index % ROWS_TO_BE_SUMMED == 0) || (index == length)) {
                        for (String tempString : tempList) {
                            processedList.add(tempString + ":" + columnSum);
                        }
                        tempList.clear(); // Clear the temporary array
                        columnSum = 0;
                    }
                } catch (NumberFormatException e) {
                    System.out.println("Invalid weightage");
                }
            } else {
                System.out.println("Invalid input");
            }
        }
    }
    return processedList;
}


@Override
public String exec(Tuple input) throws IOException {
    System.out.println("------INSIDE EXEC FUCTION ----" + input);
    if (input != null && input.size() != 0) {
        try {
            String str = (String) input.get(0);
            if (str != null) {
                String splitStrings[] = str.split(":");
                if (splitStrings != null && splitStrings.length >= 3 && splitStrings[2].equals(EXIT)) {
                    List<String> processedList = sumValues();
                    String sum = processedList.toString();
                    System.out.println("SUM VALUE----:" + sum);
                    return sum;
                } else {
                    System.out.println("INPUT VALUE----:" + str);
                    entries.add(str);
                    return null;
                }
            }
        } catch (Exception e) {
            return null;
        }
    }
    return null;
}
}

我尝试使用java UDF。但不能正常工作。

this.state = {docs: []}
this.db = this.props.db

componentDidMount () {
    this.updateDocs()
    this.db.changes({
      since: 'now',
      live: true
    }).on('change', (change) => {
      this.updateDocs()
    }).on('error', (err) => {
      console.error(err)
    })
  }

  updateDocs () {
    this.db.allDocs({include_docs: true}).then((res) => {
      var docs = res.rows.map((row) => row.doc)
      this.setState({docs})
    })
  }

上面的代码打印出空结果。 任何帮助将不胜感激。

2 个答案:

答案 0 :(得分:2)

这可以在PIG中完成。生成另一列,根据数据集中的偶数行说f11并从中减去1以创建具有相同id的2行的集合。这将允许您将这两个记录分组到新的列并将最后一列相加。然后使用关系连接新集合并获取所需的列。

注意:对于n行求和,请使用f1%n_value。

A = LOAD 'input.txt' USING PigStorage(':') AS (f1:int,f2:int,f3:double);
B = FOREACH A GENERATE f1,(f1%2 == 0 ? (f1-1):f1) AS f11,f2,f3;
C = GROUP B BY f11;
D = FOREACH C GENERATE group AS f11,SUM(f3) AS Total;
E = JOIN B BY f11,D BY f11;
F = FOREACH E GENERATE B.f1,B.f2,B.f3,D.Total;-- Note:use B::f1,B::f2,B::f3,D::Total if '.' doesn't work.

<强>输出

<强> A

1,23,0.20
2,34,0.50
3,67,0.90
4,87,0.10
5,23,0.12

B - 根据偶数行添加新的第二列 - 1。

1,1,23,0.20
2,1,34,0.50
3,3,67,0.90
4,3,87,0.10
5,5,23,0.12

C - 按新的第二列分组

1,{(1,23,0.20),(2,34,0.50)}
3,{(3,67,0.90),(4,87,0.10)}
5,{(5,23,0.12)}

D - 在分组后生成总和

1,0.70
3,1.0
5,0.12

E - 使用新列

加入上一步中使用B的数据集
1,1,23,0.20,1,0.70
2,1,34,0.50,1,0.70
3,3,67,0.90,3,1.0
4,3,87,0.10,3,1.0
5,5,23,0.12,5,0.12

E - 获取所需的列。

1,23,0.20,0.70
2,34,0.50,0.70
3,67,0.90,1.0
4,87,0.10,1.0
5,23,0.12,0.12

答案 1 :(得分:0)

在您的udf中,您收到tuple(int, chararray, chararray, int)并尝试获取第一个元素String。当您使用try...catch包围代码时,您看不到明确出现的ClassCastException。因为你已经加载了它,所以你不需要将值除以: