使用NameFinderME查找一系列标记中的名称后,我想反转标记化并使用已修改的名称重建原始文本。有没有办法可以按照执行的确切方式反转标记化操作,以便输出是输入的确切结构?
实施例
您好,我的名字是约翰。这是另一句话。
查找句子
你好我的名字是约翰 这是另一句话。
Tokenize句子。
> Hello
> my
> name
> is
> John.
>
> This
> is
> another
> sentence.
到目前为止,分析上述令牌的代码看起来像这样。
TokenNameFinderModel model3 = new TokenNameFinderModel(modelIn3);
NameFinderME nameFinder = new NameFinderME(model3);
List<Span[]> spans = new List<Span[]>();
foreach (string sentence in sentences)
{
String[] tokens = tokenizer.tokenize(sentence);
Span[] nameSpans = nameFinder.find(tokens);
string[] namedEntities = Span.spansToStrings(nameSpans, tokens);
//I want to modify each of the named entities found
//foreach(string s in namedEntities) { modifystring(s) };
spans.Add(nameSpans);
}
所需的输出,可能掩盖了找到的名称。
您好我的名字是XXXX。这是另一句话。
在文档中,有一篇指向该帖子的链接,描述了如何使用detokenizer。我不明白操作数组如何与原始标记化(如果有的话)相关
https://issues.apache.org/jira/browse/OPENNLP-216
Create instance of SimpleTokenizer.
String sentence = "He said \"This is a test\".";
SimpleTokenizer instance = SimpleTokenizer.INSTANCE;
Tokenize the sentence using tokenize(String str) method from SimpleTokenizer
String tokens[] = instance.tokenize(sentence);
The operations array must have the same number of operation name as tokens array. Basically array length should be equal.
Store the operation name N-times (tokens.length times) into operation array.
Operation operations[] = new Operation[tokens.length];
String oper = "MOVE_RIGHT"; // please refer above list for the list of operations
for (int i = 0; i < tokens.length; i++)
{ operations[i] = Operation.parse(oper); }
System.out.println(operations.length);
Here the operation array length will be equal to the tokens array length.
Now create an instance of DetokenizationDictionary by passing tokens and operations arrays to the constructor.
DetokenizationDictionary detokenizeDict = new DetokenizationDictionary(tokens, operations);
Pass DetokenizationDictionary instance to the DictionaryDetokenizer class to detokenize the tokens.
DictionaryDetokenizer dictDetokenize = new DictionaryDetokenizer(detokenizeDict);
DictionaryDetokenizer.detokenize requires two parameters. a). tokens array and b). split marker
String st = dictDetokenize.detokenize(tokens, " ");
Output: