猪 - 计算

时间:2014-10-09 06:35:58

标签: apache-pig

我在Pig中有一个数据集,如下所示:

6009544 "NY"    6009545 "NY"
6009544 "NY"    6009545 "NY"
6009548 "NY"    6009546 "OR"
6009546 "OR"    6009546 "OR"
6009545 "NY"    6009546 "OR"
6009548 "NY"    6009547 "AZ"
6009547 "AZ"    6009547 "AZ"
6009547 "AZ"    6009548 "NY"
6009544 "NY"    6009548 "NY"

第一行如下:“专利6009544起源于纽约,并引用了源自纽约的专利6009545。”我试图为每个州找到源自同一州的专利百分比。所以我的预期输出应该是

NY: .5
OR: 1
AZ: .5
由于6项专利起源于纽约,3项引用专利也起源于纽约。起源于俄勒冈州的1项专利引用了一项同样起源于纽约的专利。在源自亚利桑那州的2项专利中,1项引用了一项同样起源于亚利桑那州的专利。

任何人都可以建议在Pig中执行此操作的好方法吗?

2 个答案:

答案 0 :(得分:1)

你能试试吗?

input.txt
6009544 "NY"    6009545 "NY"
6009544 "NY"    6009545 "NY"
6009548 "NY"    6009546 "OR"
6009546 "OR"    6009546 "OR"
6009545 "NY"    6009546 "OR"
6009548 "NY"    6009547 "AZ"
6009547 "AZ"    6009547 "AZ"
6009547 "AZ"    6009548 "NY"
6009544 "NY"    6009548 "NY"

PigScript:
A = LOAD 'input.txt' AS line;
B = FOREACH A GENERATE FLATTEN(REGEX_EXTRACT_ALL(line,'(\\d+)\\s+"(\\w+)"\\s+(\\d+)\\s+"(\\w+)"')) AS (f1:int,f2:chararray,f3:int,f4:chararray);
C = GROUP B BY f2;
D = FOREACH C {
                FilterByPatent = FILTER B BY f2==f4;
                CityPatentCount = COUNT(B.f2);
                GENERATE group,((float)COUNT(FilterByPatent)/(float)CityPatentCount);
              }
DUMP D;

Output:
(AZ,0.5)
(NY,0.5)
(OR,1.0)

答案 1 :(得分:0)

我更改样本数据并使用空格分隔数据:

.carousel-inner>.item {
    -webkit-transition:2.6s ease-in-out left;
    -o-transition:2.6s ease-in-out left;
    transition:2.6s ease-in-out left;
}

输出: -

A = load '/padata' using PigStorage(' ' ) as (pno:int,pcity:chararray,pci:int,pccity:chararray);

b = group A by pcity ;

r = foreach b {

               copcity= COUNT(A.pcity) ;

               samdata = FILTER A by pcity==pccity;

               csamdata = COUNT(samdata);

               percent = (float)csamdata/(float)copcity;

               generate group,percent ;

               }

dump r ;