查找文档中引用的人

时间:2015-09-06 21:05:10

标签: javascript database node.js postgresql full-text-search

可以说,我有:

  • a 数据库 13000人参赛作品,包括first name, name, birthday, street, zip code, city

  • 长文,其中包含某个特定人的个人资料。因为它是由OCR处理的,所以它可以包含spelling errors

您可以在这里阅读文字:

  Harry Potter, born 25.03.1995, resident at Jahnstreet 43, London is a series of seven fantasy novels written by British author J. K. Rowling. The series chronicles the adventures of a young wizard, Harry Potter, the titular character, and his friends Ronald Weasley and Hermione Granger, all of whom are students at Hogwarts School of Witchcraft and Wizardry. The main story arc concerns Harry's quest to defeat the Dark wizard Lord Voldemort, who aims to become immortal, conquer the wizarding world, subjugate non-magical people, and destroy all those who stand in his way, especially Harry Potter. Since the release of the first novel, Harry Potter and the Philosopher's Stone, on 30 June 1997, the books have gained immense popularity, critical acclaim and commercial success worldwide.[2] The series has also had some share of criticism, including concern about the increasingly dark tone as the series progressed. As of May 2015, the books have sold more than 450 million copies worldwide, making the series the best-selling book series in history, and have been translated into 73 languages.[3][4] The last four books consecutively set records as the fastest-selling books in history, with the final installment selling roughly 11 million copies in the United States within the first 24 hours of its release. A series of many genres, including fantasy, coming of age and the British school story (with elements of mystery, thriller, adventureand romance), it has many cultural meanings and references.[5] According to Rowling, the main theme is death.[6] There are also many other themes in the series, such as prejudice and corruption.[7]

<小时/> 现在我想找到文档中引用的数据库中的Person

我对如何做到这一点有不同的想法。但我不知道哪一个带来了最好的结果? 您更喜欢哪种方式?推荐?谢谢

  1. 我将文本拆分成一个数组,然后遍历数据库中的每个birthday并在遇到命中时使用javascripts text.search('25.03.1995')进行搜索,我会查看下一个字段,例如: 。 text.searc('Harry')。如果有几次点击,我找到了正确的记录。

    • 专业人员:易于实施,无需数据库命令,纯粹的javascript
    • 缺点:如果OCR发生错误并阅读例如。 Harly代替Harry我无法确认它。如果日期格式不同,则会发生同样的情况
  2. 首先,我借助数据库索引文本。接下来,我采用与第一个示例类似的方法。并浏览数据库中的每一列,但现在使用数据库CONTAINS

    • 专业人士:更快,更好的结果?
    • 缺点:我需要一个好的全文搜索数据库
  3. 我拆分文本并使用sql搜索数据库列中的每个单一世界 - LIKE

    • 专业人士:我没有索引文件,比包含更好吗?
    • 缺点:没有文字索引那么快?
  4. 感谢您对此事的帮助

1 个答案:

答案 0 :(得分:1)

我认为由于OCR错误,您有时需要对多个可能的匹配进行排序,并且13000个条目不需要大量内存。因此,使用第一种方法可能更容易,并且完全在JS中完成。您必须尝试解析CSV。

这取决于我认为OCR有多糟糕。如果不好,全文索引可能会有所帮助。

您也可以尝试在npm中使用natural模块中的字符串距离。