我试图了解以下代码示例如何提取一条推文中提到的第一个Twitter句柄:
a = load '/user/pig/full_text.txt' AS (id:chararray, ts:chararray, location:chararray, lat:float, lon:float, tweet:chararray);
b = foreach a generate id, ts, location, LOWER(tweet) as tweet;
c = foreach b generate id, ts, location, REGEX_EXTRACT(tweet, '(.*)@user_(\\S{8})([:| ])(.*)',2) as tweet;
d = limit c 5;
dump d;
文件full_text.txt
中的数据采用以下格式:
USER_79321756 2010-03-03T04:15:26 ÜT: 47.528139,-122.197916 47.528139 -122.197916 RT @USER_2ff4faca: IF SHE DO IT 1 MORE TIME......IMA KNOCK HER DAMN KOOFIE OFF.....ON MY MOMMA>>haha. #cutthatout
USER_79321756 2010-03-03T04:55:32 ÜT: 47.528139,-122.197916 47.528139 -122.197916 @USER_77a4822d @USER_2ff4faca okay:) lol. Saying ok to both of yall about to different things!:*
USER_79321756 2010-03-03T05:13:34 ÜT: 47.528139,-122.197916 47.528139 -122.197916 RT @USER_5d4d777a: YOURE A FOR GETTING IN THE MIDDLE OF THIS @USER_ab059bdc WHO THE FUCK ARE YOU ? A FUCKING NOBODY !!!!>>Lol! Dayum! Aye!
USER_79321756 2010-03-03T05:28:02 ÜT: 47.528139,-122.197916 47.528139 -122.197916 @USER_77a4822d yea ok..well answer that cheap as Sweden phone you came up on when I call.
USER_79321756 2010-03-03T05:56:13 ÜT: 47.528139,-122.197916 47.528139 -122.197916 A sprite can disappear in her mouth - lil kim hmmmmm the can not the bottle right?
但是,我无法理解函数REGEX_EXTRACT(tweet, '(.*)@user_(\\S{8})([:| ])(.*)',2)
的工作方式。有人可以简单地解释这种情况下正则表达式正在搜索什么,以及索引如何选择第一个Twitter句柄。