我的数据集采用以下格式:
100000853384|RETAIL|OTHER|4.625|280000|360|02/2012|04/2012|31|31|1|23|801|NO|CASH-OUT REFINANCE|SF|1|INVESTOR|CA|945||FRM
100003735682|RETAIL|SUNTRUST MORTGAGE INC.|3.99|466000|360|01/2012|03/2012|80|80|2|30|788|NO|PURCHASE|SF|1|PRINCIPAL|MD|208||FRM
100006367485|CORRESPONDENT|PHH MORTGAGE CORPORATION|4|229000|360|02/2012|04/2012|67|67|2|36|794|NO|NO CASH-OUT REFINANCE|SF|1|PRINCIPAL|CA|959||FRM
第4条记录是ORIGINAL_INTEREST_RATE。 现在我的问题是
大多数人贷款的利率是多少。
我写下面的代码
加载数据集
loanAqiData = LOAD 'hdfs://masterNode:8020/home/hadoop/hadoop_data/LOAN_Acquisition_DATA/Acquisition_2012Q1.txt'
USING PigStorage('|')
AS
(
LOAN_IDENTIFIER:chararray
, CHANNEL:chararray
, SELLER_NAME:chararray
, ORIGINAL_INTEREST_RATE:float
, ORIGINAL_UNPAID_PRINCIPAL_BALANCE :float
, ORIGINAL_LOAN_TERM :float
, ORIGINATION_DATE:chararray
, FIRST_PAYMENT_DATE:chararray
, ORIGINAL_LOAN_TO_VALUE:float
, ORIGINAL_COMBINED_LOAN_TO_VALUE :float
, NUMBER_OF_BORROWERS:float
, DEBT_TO_INCOME_RATIO:float
, CREDIT_SCORE:float
, FIRST_TIME_HOME_BUYER_INDICATOR:chararray
, LOAN_PURPOSE:chararray
, PROPERTY_TYPE:chararray
, NUMBER_OF_UNITS:chararray
, OCCUPANCY_STATUS:chararray
, PROPERTY_STATE:chararray
, ZIP:chararray
, MORTGAGE_INSURANCE_PERCENTAGE:float
, PRODUCT_TYPE:chararray
);
// - 按利率分组
grouped_by_interest_rate = group loanAqiData by ORIGINAL_INTEREST_RATE;
个人利率的数量
count_for_specific_interest = FOREACH grouped_by_interest_rate GENERATE group as INTEREST_RATE, COUNT(loanAqiData) as NO_OF_PEOPLE;
转储
dump count_for_specific_interest
输出
(3.625,1) (3.75,2) 的(3.875,26) (3.99,8) (4.0,21) (4.1,1) (4.125,15) (4.25,16) (4.375,15) 的(4.376,26) (4.5,10) (4.625,3)
但我想得到 (3.875,26)和(4.376,26)
我如何获得?
此外,如果我想获得贷款利息,那么至少没有人获得贷款..
答案 0 :(得分:0)
我建议您使用MAX()函数(http://pig.apache.org/docs/r0.11.0/func.html#max)确定最多人数,然后按此数字进行过滤。
以下是应该工作(未经测试)的代码示例:
FOREACH count_for_specific_interest {
max_value= MAX($1.NO_OF_PEOPLE);
GENERATE INTEREST_RATE, NO_OF_PEOPLE, max_value;
}
RESULT = FILTER count_for_specific_interest BY NO_OF_PEOPLE==max_value;
对于min,你可以使用完全相同的脚本,用MIN()
代替MAX()答案 1 :(得分:0)
最后这个解决了。 让我写下这些步骤
1)加载
2)按利益分组
grp = group loanAqiData by ORIGINAL_INTEREST_RATE;
3)针对每项兴趣计算人数
cntForEachGrp = FOREACH grp GENERATE group as
INTEREST_RATE, COUNT(loanAqiData) as NO_OF_PEOPLE;
输出
(3.625,1)(3.75,2)(3.875,26)(3.99,8)(4.0,21)(4.1,1)(4.125,15)(4.25,16)(4.375,15)(4.376) ,26)(4.5,10)(4.625,3)
4)将它们分组以放入相同的BAG
grpALL = GROUP cntForEachGrp ALL;
(所有,{(3.625,1),(3.75,2),(3.875,26),(3.99,8),(4.0,21),(4.1,1),(4.125,15),( 4.25,16),(4.375,15),(4.376,1),(4.5,10),(4.625,3),(4.75,5),(4.875,4),(5.0,2),(5.25, 1)})
5)计算BAG中的最大人数
maxVal = FOREACH grpALL {
max_value= MAX(cntForEachGrp.NO_OF_PEOPLE);
GENERATE cntForEachGrp.INTEREST_RATE, cntForEachGrp.NO_OF_PEOPLE, max_value as
max_no;
}
grunt> describe maxVal;
maxVal: {{(INTEREST_RATE: float)},{(NO_OF_PEOPLE: long)},max_no: long}
dump maxVal;
({(3.625),(3.75),(3.875),(3.99),(4.0),(4.1),(4.125),(4.25),(4.375),(4.376),(4.5),( 4.625),(4.75),(4.875),(5.0),(5.25)},{(1),(2),(26),(8),(21),(1),(15),( 16),(15),(1),(10),(3),(5),(4),(2),(1)},<强> 26 强>)
6)过滤掉最多没有人的贷款利息
RESULT=FILTER cntForEachGrp BY NO_OF_PEOPLE == maxVal.max_no ;
转储后我们获得利率-3.875最多没有人26。
我们为什么要这样做
grpALL = GROUP cntForEachGrp ALL;
和
(5)
中嵌套foreach的内在含义是什么?