Question

以下是我在猪身上做的标记， 我的猪脚本

--set the debugg mode
SET debug 'off'
-- Registering the python udf
REGISTER /home/hema/phd/work1/coding/myudf.py USING streaming_python as myudf

RAWDATA =LOAD '/home/hema/temp' USING TextLoader() AS content;
LOWERCASE_DATA =FOREACH RAWDATA GENERATE LOWER(content) AS con;
TOKENIZED_DATA =FOREACH LOWERCASE_DATA GENERATE myudf.special_tokenize(con) as conn;
DUMP TOKENIZED_DATA;

我的Python UDF

from pig_util import outputSchema
import nltk

@outputSchema('word:chararray')
def special_tokenize(input):
    tokens=nltk.word_tokenize(input)
    return tokens

代码工作正常，但输出很乱。如何删除不需要的下面和垂直条。输出看起来像这样

(|{_|(_additionalcontext|)_|,_|(_in|)_|,_|(_namefinder|)_|}_)
(|{_|(_is|)_|,_|(_there|)_|,_|(_any|)_|,_|(_possibility|)_|,_|(_to|)_|,_|(_use|)_|,_|(_additionalcontext|)_|,_|(_with|)_|,_|(_the|)_|,_|(_namefinderme.train|)_|,_|(_?|)_|,_|(_if|)_|,_|(_so|)_|,_|(_,|)_|,_|(_how|)_|,_|(_?|)_|,_|(_if|)_|,_|(_there|)_|,_|(_is|)_|,_|(_n't|)_|,_|(_maybe|)_|,_|(_this|)_|,_|(_should|)_|,_|(_be|)_|,_|(_an|)_|,_|(_issue|)_|,_|(_to|)_|,_|(_be|)_|,_|(_added|)_|,_|(_in|)_|,_|(_the|)_|,_|(_future|)_|,_|(_releases|)_|,_|(_?|)_|}_)
(|{_|(_i|)_|,_|(_would|)_|,_|(_really|)_|,_|(_greatly|)_|,_|(_appreciate|)_|,_|(_if|)_|,_|(_someone|)_|,_|(_can|)_|,_|(_help|)_|,_|(_(|)_|,_|(_give|)_|,_|(_me|)_|,_|(_some|)_|,_|(_sample|)_|,_|(_code/show|)_|,_|(_me|)_|,_|(_)|)_|,_|(_how|)_|,_|(_to|)_|,_|(_add|)_|,_|(_pos|)_|,_|(_tag|)_|,_|(_features|)_|,_|(_while|)_|,_|(_training|)_|,_|(_and|)_|,_|(_testing|)_|,_|(_namefinder|)_|,_|(_.|)_|}_)
(|{_|(_if|)_|,_|(_the|)_|,_|(_incoming|)_|,_|(_data|)_|,_|(_is|)_|,_|(_just|)_|,_|(_tokens|)_|,_|(_with|)_|,_|(_no|)_|,_|(_pos|)_|,_|(_tag|)_|,_|(_information|)_|,_|(_,|)_|,_|(_where|)_|,_|(_is|)_|,_|(_the|)_|,_|(_information|)_|,_|(_taken|)_|,_|(_then|)_|,_|(_?|)_|,_|(_a|)_|,_|(_new|)_|,_|(_file|)_|,_|(_?|)_|,_|(_run|)_|,_|(_a|)_|,_|(_pos|)_|,_|(_tagging|)_|,_|(_model|)_|,_|(_before|)_|,_|(_training|)_|,_|(_?|)_|,_|(_or|)_|,_|(_?|)_|}_)
(|{_|(_and|)_|,_|(_what|)_|,_|(_is|)_|,_|(_the|)_|,_|(_purpose|)_|,_|(_of|)_|,_|(_the|)_|,_|(_resources|)_|,_|(_(|)_|,_|(_i.e|)_|,_|(_.|)_|,_|(_collection.|)_|,_|(_<|)_|,_|(_string|)_|,_|(_,|)_|,_|(_object|)_|,_|(_>|)_|,_|(_emptymap|)_|,_|(_(|)_|,_|(_)|)_|,_|(_)|)_|,_|(_in|)_|,_|(_the|)_|,_|(_namefinderme.train|)_|,_|(_method|)_|,_|(_?|)_|,_|(_what|)_|,_|(_should|)_|,_|(_be|)_|,_|(_ideally|)_|,_|(_included|)_|,_|(_in|)_|,_|(_there|)_|,_|(_?|)_|}_)
(|{_|(_i|)_|,_|(_just|)_|,_|(_ca|)_|,_|(_n't|)_|,_|(_get|)_|,_|(_these|)_|,_|(_things|)_|,_|(_from|)_|,_|(_the|)_|,_|(_java|)_|,_|(_doc|)_|,_|(_api|)_|,_|(_.|)_|}_)
(|{_|(_in|)_|,_|(_advance|)_|,_|(_!|)_|}_)
(|{_|(_best|)_|,_|(_,|)_|}_)
(|{_|(_svetoslav|)_|}_)

原始数据

AdditionalContext in NameFinder 
Is there any possibility to use additionalContext with the NameFinderME.train? If so, how? If there isn't maybe this should be an issue to be added in the future releases?
I would REALLY greatly appreciate if someone can help (give me some sample code/show me)  how to add POS tag features while training and testing NameFinder.
If the incoming data is just tokens with NO POS tag information, where is the information taken then? A new file? Run a POS tagging model before training? Or?
And what is the purpose of the resources (i.e. Collection.<String,Object>emptyMap()) in the NameFinderME.train method? What should be ideally included in there?
I just can't get these things from the Java doc API.
 in advance!
Best,
Svetoslav

我希望有一个令牌列表作为我的最终输出。谢谢你。

Answer 1

将REPLACE用于＆＃39; _＆＃39;和＆＃39; |＆＃39;然后使用TOKENIZE作为令牌。

NEW_TOKENIZED_DATA =FOREACH TOKENINZED_DATA GENERATE REPLACE(REPLACE($0,'_',''),'|','');
TOKENS = FOREACH NEW_TOKENIZED_DATA GENERATE TOKENIZE($0);
DUMP TOKENS;

Answer 2

from pig_util import outputSchema
import nltk
import re 

@outputSchema('word:chararray')
def special_tokenize(input):
    #splitting camel-case here
    temp_data = re.sub(r'(?<=[a-z])(?=[A-Z])|(?<=[A-Z])(?=[A-Z][a-z])'," ",input)
    tokens = nltk.word_tokenize(temp_data.encode('utf-8'))
    final_token = ','.join(tokens) 
    return final_token

输入的编码存在一些问题。将其更改为utf-8解决了这个问题。

在猪中进行标记（使用python udf）

2 个答案: