Question

我试图在字符串中进行搜索，但有时可能会更改。我要输入一个csv文本，而用户应该只写一个文本。

ex ：（这是csv数据中的单元格）i want grocery shop/bakery/coffee shop/ pizza shop/ burger shop，（这是用户输入）i crave grocery shop

因此用户可以根据需要在杂货店或上面的其他任何选项创建短语，这取决于他在音频文件中所听的内容。

但是他决定写而不是“想要” ，“渴望” 。

注意：用户不知道我所拥有的数据，因为他只能访问音频，并且他转录自己所听的内容取决于他的声音。我不能强迫他采取特定的方式或写特定的短语。

有没有一种方法可以使我比较用户输入的数据和我拥有的数据，并使其成为有效输入，因为两者在一天结束时都具有相同的含义。

我尝试使用if (data.includes(user phrase))->这使我错了。

我在一个csv文件中存储的数据，我读取了它们，并将它们存储在Object中，其中每组短语都被分组为一个特定的键（短语类型），用户数据是一个普通字符串。

我如何才能部分比较这两个字符串。

-更新

我也尝试过levenshtein相似性...但是我得到的数字非常小，它与用户输入的最高相似短语的取值为0.18，因此我不能以此作为参考的临界值说这是否相似。

Answer 1

在两个句子之间进行比较确实是一个复杂的问题。您要做的是使用某种stringSimmilarity函数，该函数将返回一个介于0到1之间的值，该值表示这两个字符串的相似程度。我正在使用这个：

function compareTwoStrings(first, second) {
  first = first.replace(/\s+/g, '').toLowerCase();
  second = second.replace(/\s+/g, '').toLowerCase();

  if (!first.length && !second.length) return 1; // if both are empty strings
  if (!first.length || !second.length) return 0; // if only one is empty string
  if (first === second) return 1; // identical
  if (first.length === 1 && second.length === 1) return 0; // both are 1-letter strings
  if (first.length < 2 || second.length < 2) return 0; // if either is a 1-letter string

  let firstBigrams = new Map();
  for (let i = 0; i < first.length - 1; i++) {
    const bigram = first.substring(i, i + 2);
    const count = firstBigrams.has(bigram) ? firstBigrams.get(bigram) + 1 : 1;

    firstBigrams.set(bigram, count);
  }

  let intersectionSize = 0;
  for (let i = 0; i < second.length - 1; i++) {
    const bigram = second.substring(i, i + 2);
    const count = firstBigrams.has(bigram) ? firstBigrams.get(bigram) : 0;

    if (count > 0) {
      firstBigrams.set(bigram, count - 1);
      intersectionSize++;
    }
  }

  return (2.0 * intersectionSize) / (first.length + second.length - 2);
}

function findBestMatch(mainString, targetStrings) {
  if (!areArgsValid(mainString, targetStrings)) throw new Error('Bad arguments: First argument should be a string, second should be an array of strings');

  const ratings = [];
  let bestMatchIndex = 0;

  for (let i = 0; i < targetStrings.length; i++) {
    const currentTargetString = targetStrings[i];
    const currentRating = compareTwoStrings(mainString, currentTargetString);
    ratings.push({ target: currentTargetString, rating: currentRating });
    if (currentRating > ratings[bestMatchIndex].rating) {
      bestMatchIndex = i;
    }
  }

  const bestMatch = ratings[bestMatchIndex];

  return { ratings, bestMatch, bestMatchIndex };
}

function areArgsValid(mainString, targetStrings) {
  if (typeof mainString !== 'string') return false;
  if (!Array.isArray(targetStrings)) return false;
  if (!targetStrings.length) return false;
  if (targetStrings.find(s => typeof s !== 'string')) return false;
  return true;
}

让我们分解一下： findBestMatch获取两个参数-一个字符串和一个字符串数组（在您的情况下，这将是用户输入，而带有字符串的csv）。然后，它对数组中的每个字符串使用compareTwoStrings来计算它们的相似程度。它返回评分，bestMatch及其索引。然后，您可以确定接受的匹配的阈值。在您的情况下-在用户输入和给定的csv行上运行findBestMatch函数-得分是0.28，还差得远... 但是，如果您的csv文本可能是这样的： i want grocery shop，i want bakery shop等得分为0.65-可以接受。

Answer 2

如果您愿意使用第三方工具，Elasticsearch将能够为您提供所需的结果。它支持所谓的全文查询。这意味着它将从您的用户输入中提取一个字符串，并与您在CSV中定义的所有现有文本行进行模糊比较。

当然，这需要您a）安装Elasticsearch和b）预先提交所有现有的文本行。但是，它将允许您进行自由形式的搜索，并将按得分对结果进行排名（即最接近的匹配优先）。

我已经链接了描述基本搜索功能的指南页面。《入门指南》的第一章（“您知道，用于搜索”）也对端到端过程进行了非常清晰的演练。

https://www.elastic.co/guide/en/elasticsearch/guide/current/search.html

在2个字符串之间进行部分比较

2 个答案: