Hadoop Pig - 将字符串替换为地图中相应值的关系

时间:2013-07-17 00:28:59

标签: apache-pig

我有一个名为conversations_grouped的关系,由不同大小的元组组成,如下所示:

DUMP conversations_grouped:
...
({(L194),(L195),(L196),(L197)})
({(L198),(L199)})
({(L200),(L201),(L202),(L203)})
({(L204),(L205),(L206)})
({(L207),(L208)})
({(L271),(L272),(L273),(L274),(L275)})
({(L276),(L277)})
({(L280),(L281)})
({(L363),(L364)})
({(L365),(L366)})
({(L666256),(L666257)})
({(L666369),(L666370),(L666371),(L666372)})
({(L666520),(L666521),(L666522)})

每个L [0-9] +是对应于字符串的标签。例如,L194可能是“你好,你好吗?”和L195可能“很好,你好吗?”。该对应关系由名为line_map的地图维护。这是一个示例:

DUMP line_map;
...
([L666324#Do you think she might be interested in  someone?])
([L666264#Well that's typical of Her Majesty's army. Appoint an engineer to do a soldier's work.])
([L666263#Um. There are rumours that my Lord Chelmsford intends to make Durnford Second in Command.])
([L666262#Lighting COGHILL' 5 cigar: Our good Colonel Dumford scored quite a coup with the Sikali Horse.])
([L666522#So far only their scouts. But we have had reports of a small Impi farther north, over there. ])
([L666521#And I assure you, you do not In fact I'd be obliged for your best advice. What have your scouts seen?])
([L666520#Well I assure you, Sir, I have no desire to create difficulties. 45])
([L666372#I think Chelmsford wants a good man on the border Why he fears a flanking attack and requires a steady Commander in reserve.])
([L666371#Lord Chelmsford seems to want me to stay back with my Basutos.])
([L666370#I'm to take the Sikali with the main column to the river])
([L666369#Your orders, Mr Vereker?])
([L666257#Good ones, yes, Mr Vereker. Gentlemen who can ride and shoot])
([L666256#Colonel Durnford... William Vereker. I hear you 've been seeking Officers?])

我现在要做的是解析每一行并将L [0-9] +标签替换为line_map中的相应文本。是否可以在Pig FOREACH语句中引用line_map,或者我还需要做些什么呢?

1 个答案:

答案 0 :(得分:1)

第一个问题是,在地图中,键必须是带引号的字符串。因此,您无法使用架构值来访问地图。例如。 无效。

C: {foo: chararray, M: [value:chararray]}
D = FOREACH C GENERATE M#foo ;

想到的解决方案是FLATTEN conversations_grouped。然后在L [0-9] +标记上的conversations_grouped和line_map之间进行连接。你可能想要投出一些额外的字段(比如连接后的L [0-9] +标签),以便更快地完成下一步。之后,您将不得不重新组合数据,并按照正确的格式进行按摩。

这不起作用,除非每个行李都有自己的重新组合ID,但如果每个L [0-9] +标签只出现在一个行李(对话)中,您可以使用它创建一个唯一的ID

-- A is dumped conversations_grouped

B = FOREACH A {
    -- Pulls out an element from the bag to use as the id
    id = LIMIT tags 1 ;
    -- Flattens B into id, tag form.  Each group of tags will have the same id.
    GENERATE FLATTEN(id), FLATTEN(tags) ; 
    } 

B的架构和输出是:

B: {id: chararray,tags::tag: chararray}
(L194,L194)
(L194,L195)
(L194,L196)
(L194,L197)
(L198,L198)
(L198,L199)
(L200,L200)
(L200,L201)
(L200,L202)
(L200,L203)
(L204,L204)
(L204,L205)
(L204,L206)
(L207,L207)
(L207,L208)
(L271,L271)
(L271,L272)
(L271,L273)
(L271,L274)
(L271,L275)
(L276,L276)
(L276,L277)
(L280,L280)
(L280,L281)
(L363,L363)
(L363,L364)
(L365,L365)
(L365,L366)
(L666256,L666256)
(L666256,L666257)
(L666369,L666369)
(L666369,L666370)
(L666369,L666371)
(L666369,L666372)
(L666520,L666520)
(L666520,L666521)
(L666520,L666522)

假设标签是唯一的,其余部分就像:

-- A2 is line_map, loaded in tag/message pairs instead of a map

-- Joins conversations_grouped and line_map on tag
C = FOREACH (JOIN B by tags::tag, A2 by tag)
    -- This generate removes the tag
    GENERATE id, message ;

-- Regroups C on the id created in B
D = FOREACH (GROUP C BY id) 
    -- This step limits the output to just messages
    GENERATE C.(message) AS messages ;

D的模式和输出:

D: {messages: {(A2::message: chararray)}}
({(Colonel Durnford... William Vereker. I hear you 've been seeking Officers?),(Good ones, yes, Mr Vereker. Gentlemen who can ride and shoot)})
({(Your orders, Mr Vereker?),(I'm to take the Sikali with the main column to the river),(Lord Chelmsford seems to want me to stay back with my Basutos.),(I think Chelmsford wants a good man on the border Why he fears a flanking attack and requires a steady Commander in reserve.)})
({(Well I assure you, Sir, I have no desire to create difficulties. 45),(And I assure you, you do not In fact I'd be obliged for your best advice. What have your scouts seen?),(So far only their scouts. But we have had reports of a small Impi farther north, over there. )})

注意:如果最坏的话,(L [0-9] +标签不是唯一的)你可以给你输入文件的每一行一个顺序的整数id把它装进猪里。

更新:如果您使用的是猪0.11,那么您也可以使用RANK运算符。