我的要求如下。下面是我输入的元组。
Documents, PatternA, PatternB
['D1' , 'S1' , 'V1']
['D1' , 'S1' , 'V2']
['D1' , 'S2' , 'V1']
['D1' , 'S2' , 'V2']
['D2' , 'S1' , 'V1']
['D2' , 'S1' , 'V2']
['D2' , 'S1' , 'V3']
['D2' , 'S2' , 'V1']
['D2' , 'S2' , 'V2']
['D2' , 'S2' , 'V3']
['D2' , 'S3' , 'V1']
['D2' , 'S3' , 'V2']
['D2' , 'S3' , 'V3']
根据我的要求,我必须找出有多少PatternA,PatternB对出现了多少个Documents。 例如,S1,V1对基于上述输入元组出现在D1和D2(两个文档)中。 示例:
['D1','S1','V1'] -input tuple line 1
['D2','S1','V1'] -input tuple line 5
PatternA,PatternB对的计数在名为X的文档中出现(我需要此X作为结果输出) 下面是我的代码,以达到上述要求。
Pipe pipe = new Pipe("All the above 13 records");
Fields groupFieldsX = new Fields("PatternA","PatternB"); //for calculating X
xPipe = new GroupBy(pipe, groupFieldsX, new Fields("PatternA","PatternB"));
xPipe = new Every(xPipe, Fields.ALL, new Count(new Fields("Xcount")), Fields.ALL);
xPipe = new Each(xPipe, new Debug());
结果X
['S1', 'V1', '2']
['S1', 'V2', '2']
['S1', 'V3', '1']
['S2', 'V1', '2']
['S2', 'V2', '2']
['S2', 'V3', '1']
['S3', 'V1', '1']
['S3', 'V2', '1']
['S3', 'V3', '1']
元组数:9
直到这里我很好。
现在我需要找出Y.Y只是在Documents中发生PatternA的时间(只应选择唯一值) 例如:
D1, S1
D2, S1
D1, S1
(这里S1只发生在2个文件中,因为第三个是重复的。所以Y计数是2)
以下是我计算Y值的代码
pipe = new Retain(pipe, new Fields("Documents","PatternA"));// from the 13 tuples I take only Documents,PatternA
pipe = new Unique(pipe, new Fields("Documents","PatternA"));// eliminate dupilcate
Fields groupFieldsY = new Fields("Documents","PatternA"); //for calculating Y
Pipe yPipe = new GroupBy(pipe, groupFieldsY);
yPipe = new Every(yPipe,Fields.ALL, new Count(new Fields("Ycount")), Fields.ALL);
yPipe = new Each(yPipe, new Debug());
结果Y
['D1', 'S1', '1']
['D1', 'S2', '1']
['D2', 'S1', '1']
['D2', 'S2', '1']
['D2', 'S3', '1']
元组数:5
现在这里是我需要的最终输出结果X&结果Y。
报告:
PatternA, PatternB, Xcount, Documents, Ycount
S1 , V1 , 2 , D1,D2 , 2
S1 , V2 , 2 , D1,D2 , 2
...报告继续
现在有人建议我如何从总共9 + 5元组输出中得出报告。如果还有其他任何有效的建议我也建议我
计算X和Y值并生成报告的方法。
先感谢您的意见/建议/解决方案。