用于规范化文本源并从规范化源构建原始源的工具

时间:2015-09-24 16:31:00

标签: java text nlp normalization

有人知道Java上的工具/项目可以规范化文本(并存储规范化日志),然后构建原始源文本吗?

感谢任何方法。

问题: 为了处理输入数据,我们需要对其进行标准化。

流程引擎接收标准化文本并返回匹配的位置。

在此步骤之后,我们需要通过标准化位置恢复原始源等效。

示例:

Source:
Lorem ipsum ad his scripta blandit partiendo, eum fastidii accumsan euripidis in, eum liber hendrerit an ... ütf Wórd èxämplé

Normalized text (approx):
lorem ipsum scripta blandit partiendo, fastidi accumsan euripidis, liber hendrerit utf word example

Engine output:
lorem ipsum scripta begin 0 end 19
euripidis           begin 56 end 65

Original source equivalent:
Lorem ipsum ad his scripta begin 0 end 26
euripidis                  begin 69 end 78

感谢您的帮助

1 个答案:

答案 0 :(得分:0)

解决此问题的最佳方法是,我们已使用Regex

// Given
Source:
Lorem ipsum ad his scripta blandit partiendo, eum fastidii accumsan euripidis in, eum liber hendrerit an ... ütf Wórd èxämplé

Stopwords:
ad, his, eum, in, an

ASCII text:
Lorem ipsum ad his scripta blandit partiendo, eum fastidii accumsan euripidis in, eum liber hendrerit an ... utf Word example

Normalized text (approx):
lorem ipsum scripta blandit partiendo, fastidi accumsan euripidis, liber hendrerit utf word example

// Then
Engine output:
lorem ipsum scripta begin 0 end 19
euripidis           begin 56 end 65

To take original text from normalized, used Regex
lorem( (ad|his|eum|in|an))* ipsum( (ad|his|eum|in|an))* scripta
euripidis

// Verify

Original source equivalent:
Lorem ipsum ad his scripta begin 0 end 26
euripidis                  begin 69 end 78