从列表或正文中提取重复的子字符串

时间:2018-08-07 16:52:38

标签: nlp text-extraction

是-多年来,人们一直在不同地询问这个问题,但似乎有很多变量会影响基于用例的最佳方法。

我们有一个网页标题数据库-其中许多包含重复的字符串-例如网站名称或网站部分,或两者兼而有之。我们正在尝试提取重复次数最多的短语以创建字典,该字典将允许我们在单独的过程中删除子字符串。

分析通常针对10,000行文本,最大长度为256个字符。我们还希望子字符串由特殊字符定界,例如“-”,“ |”或“:”

我们已经看到的解决方案包括正则表达式,后缀数组和后缀树,但是我们不确定在我们的数据结构中哪种方法最有效。每天需要针对唯一列表进行数千次此计算。

以下是列表的示例:

Sports lottery sales soar 70% in June on FIFA World Cup | Society | FOCUS TAIWAN - CNA ENGLISH NEWS
Scorching heat forecast to continue Tuesday | Society | FOCUS TAIWAN - CNA ENGLISH NEWS
Tech startups eye Taiwan's market | video | FOCUS TAIWAN - CNA ENGLISH NEWS
Taiwan headline news | Society | FOCUS TAIWAN - CNA ENGLISH NEWS
About 30% of working fathers feel alienated from children: poll | Society | FocusTaiwan Mobile - CNA English News
Taiwan wins 2 championships in Pony League baseball world series | Entertainment & Sports | FocusTaiwan Mobile - CNA English News
A smart way to escape in a fire | video | FOCUS TAIWAN - CNA ENGLISH NEWS
Taiwan headline news | What the Papers Say | FOCUS TAIWAN - CNA ENGLISH NEWS
Taiwanese co-authors article nominated by European publisher | Society | FOCUS TAIWAN - CNA ENGLISH NEWS
Taiwan to help Indonesia with post-earthquake relief: MOFA | Politics | FOCUS TAIWAN - CNA ENGLISH NEWS
Taiwan shares close down 0.37% | Economics | FocusTaiwan Mobile - CNA English News
Taiwan shares open higher | Economics | FocusTaiwan Mobile - CNA English News
U.S. dollar closes lower on Taipei forex market | Economics | FocusTaiwan Mobile - CNA English News

我们希望从该列表中接收包括标点或定界符在内的数据:

| FOCUS TAIWAN - CNA ENGLISH NEWS = 8 Occurrences
| FocusTaiwan Mobile - CNA English News = 5 Occurrences 
| Society | FOCUS TAIWAN - CNA ENGLISH NEWS = 4 Occurrences 
| Economics | FocusTaiwan Mobile - CNA English News = 3 Occurrences

以此类推...关于最合适的调查方法的任何建议都将受到欢迎。

*从https://www.online-utility.org/text/analyzer.jsp提取的样本分析数据

0 个答案:

没有答案