Question

我使用带有全文索引的MyISAM引擎来存储字符串列表。

这些字符串可以是单个单词，也可以是句子。

如果我想知道字符串hello出现在我的表格中的次数，我会

SELECT COUNT(*) Total 
    FROM String s
WHERE
    MATCH (s.name) AGAINST ('hello')

我想创建一个类似的报告，但是对于所有字符串。结果应该是此表中最常见的TOP-N字符串列表（最常见的是“the”，“a”，“to”等）。

完全匹配的情况非常明显：

SELECT name as String, count(*) as Total
    FROM String
GROUP 
    BY name
ORDER
    BY total desc
LIMIT *some number*

但它只计算整个字符串。

有没有办法达到我想要的结果？

感谢。

Answer 1

我想这没有简单的方法。我只会为此目的创建一个“统计表”。一列用于单词本身，一列用于出现次数。（当然，第一列的主键。）

为此，使用PL / SQL块扫描所有字符串，并将其拆分为单词。如果在统计信息表中找不到该字符串，则插入一个新行。如果在统计信息表中找到该字符串，则会增加第二列中的值。

这可能会运行很长时间，但是在第一次运行准备就绪后，您只需要检查插入时的新字符串，可能使用触发器。（假设您不想使用它，而是定期使用它。）

希望这有帮助，我没有更简单的答案。

Answer 2

我认为如果使用LIKE命令将起作用

select name, count(*) as total from String where name like '%hello%' group by name order by total

让我知道

Answer 3

我没有找到任何SQL和我的全文索引的解决方案，但我设法通过从DB获取所有字符串并使用php在后端处理它们来获得我想要的结果：

//get all strings from DB
$queryResult = $db->query("SELECT name as String FROM String");

//Combine all of them into array
while($row = $queryResult->fetch_array(MYSQLI_ASSOC)) {
    $stringArray[] = $row['String'];
}

//"Glue" all these strings into one huge string
$text = implode(" ", $stringArray);

//Make everything lowercase
$textLowercase = strtolower($text);

//Find all words
$result = preg_split('/[^a-z]/', $textLowercase, -1, PREG_SPLIT_NO_EMPTY);

//Filter some unwanted words
$result = array_filter($result, function($x){
    return !preg_match("/^(.|the|and|of|to|it|in|or|is|a|an|not|are)$/",$x);
});

//Count a number of occurrence of each word
$result = array_count_values($result);

//Sort
arsort($result);

//Select TOP-N strings, where N is $amount
$result = array_slice($result, 0, $amount);

在表格中查找最受欢迎的字符串

3 个答案: