如何通过名字查找器运行令牌后反转标记化?

时间:2018-04-09 16:15:27

标签: opennlp

使用NameFinderME查找一系列标记中的名称后,我想反转标记化并使用已修改的名称重建原始文本。有没有办法可以按照执行的确切方式反转标记化操作,以便输出是输入的确切结构?

实施例

  

您好,我的名字是约翰。这是另一句话。

查找句子

  

你好我的名字是约翰   这是另一句话。

Tokenize句子。

> Hello 
> my 
> name 
> is 
> John.
> 
> This 
> is 
> another 
> sentence.

到目前为止,分析上述令牌的代码看起来像这样。

            TokenNameFinderModel model3 = new TokenNameFinderModel(modelIn3);
            NameFinderME nameFinder = new NameFinderME(model3);

            List<Span[]> spans = new List<Span[]>();
            foreach (string sentence in sentences)
            {
                String[] tokens = tokenizer.tokenize(sentence);

                Span[] nameSpans = nameFinder.find(tokens);
                string[] namedEntities = Span.spansToStrings(nameSpans, tokens);


                //I want to modify each of the named entities found
                //foreach(string s in namedEntities) { modifystring(s) };


                spans.Add(nameSpans);

            }

所需的输出,可能掩盖了找到的名称。

  

您好我的名字是XXXX。这是另一句话。

在文档中,有一篇指向该帖子的链接,描述了如何使用detokenizer。我不明白操作数组如何与原始标记化(如果有的话)相关

https://issues.apache.org/jira/browse/OPENNLP-216

Create instance of SimpleTokenizer.
String sentence = "He said \"This is a test\".";
SimpleTokenizer instance = SimpleTokenizer.INSTANCE;
Tokenize the sentence using tokenize(String str) method from SimpleTokenizer
String tokens[] = instance.tokenize(sentence);
The operations array must have the same number of operation name as tokens array. Basically array length should be equal.
Store the operation name N-times (tokens.length times) into operation array.
Operation operations[] = new Operation[tokens.length];
String oper = "MOVE_RIGHT"; // please refer above list for the list of operations
for (int i = 0; i < tokens.length; i++) 
{ operations[i] = Operation.parse(oper); } 
System.out.println(operations.length); 
Here the operation array length will be equal to the tokens array length.
Now create an instance of DetokenizationDictionary by passing tokens and operations arrays to the constructor.
DetokenizationDictionary detokenizeDict = new DetokenizationDictionary(tokens, operations);
Pass DetokenizationDictionary instance to the DictionaryDetokenizer class to detokenize the tokens.
DictionaryDetokenizer dictDetokenize = new DictionaryDetokenizer(detokenizeDict);
DictionaryDetokenizer.detokenize requires two parameters. a). tokens array and b). split marker 
String st = dictDetokenize.detokenize(tokens, " ");
Output:

1 个答案:

答案 0 :(得分:0)

使用Detokenizer

String text = detokenize(myTokens, null);