我有2个文件(消息,密钥)。我想把所有的信息都排除在外。'信息'其中包括来自'键的单词'。
messages = LOAD 'my-messages.txt' as (message:chararray);
keys = LOAD 'keys.txt' as (key: chararray);
现在我知道我可以在消息之间进行内部联接。密钥,但在以下情况下不会起作用:
message = "hi there"
key = "hi"
我认为UDF是一种绕过它的方法:
DEFINE containsKey my.udf.Matches("path/keys.txt");
matches = FILTER messages BY containsKey(messages);
然后在UDF循环中通过所有键(yikes!)感觉不对......不确定我的方法是否正确,所以随时提供建议。
答案 0 :(得分:2)
这看起来像是可以使用CROSS的用例。参考:http://pig.apache.org/docs/r0.7.0/piglatin_ref2.html#CROSS
这可能不是最佳解决方案,共享可行的方法。
输入:
消息:
set(CMAKE_THREAD_LIBS_INIT "-lpthread")
set(CMAKE_HAVE_THREADS_LIBRARY 1)
set(CMAKE_USE_WIN32_THREADS_INIT 0)
set(CMAKE_USE_PTHREADS_INIT 1)
密钥:
hi there
He said "Hi, how are you doing ?"
HI there
Hello there
猪脚本:
hi
输出
messages = LOAD 'messages.csv' USING PigStorage('\t') AS (message:chararray);
keys = LOAD 'keys.csv' USING PigStorage('\t') AS (key:chararray);
crossed_data = CROSS messages, keys ;
filt_required_data = FILTER crossed_data BY LOWER(messages::message) MATCHES CONCAT('.*', LOWER(keys::key), '.*');
required_data = FOREACH filt_required_data GENERATE messages::message AS message;
DUMP required_data;