一列中有多个文本值,需要查询才能找到最可重复的单词

时间:2013-10-15 23:28:39

标签: mysql

我有一个专栏,用于存储用户的生物/标题。它是由用户自定义编写的,可以包含尽可能多的单词。

id title
1  Business Development Executive Cold Calling & Cold Emailing expert Entrepreneur
2  Director of Online Marketing and entrepreneur
3  Art Director and Entrepreneur 
4  Corporate Development at Yahoo!
5  Snr Program Manager, Yahoo 

我试图找出一个显示单词频率的mysql查询:

Entrepreneur 3
development  2
director     2 

我知道如果我可以将值中的每个单词作为单独的行返回,那么我可以使用正常的分组。我看了,但找不到一个函数,它将文本分成单独的行中的单词。

可以吗?

2 个答案:

答案 0 :(得分:4)

您可以通过加入用于挑选第n个单词的制造编号系列来完成此操作。不幸的是,如果生成一个系列,mysql没有内置方法,所以它有点难看,但在这里它是:

select
  substring_index(substring_index(title, ' ', num), ' ', -1) word,
  count(*) count
from job j
join (select 1 num union select 2 union select 3 union select 4 union select 5 union select 6 union select 7 union select 8 union select 9 union select 10 union select 11 union select 12) n
on length(title) >= length(replace(title, ' ', '')) + num - 1
group by 1
order by 2 desc

使用您的数据查看live demo on SQLFiddle并生成预期的输出。

遗憾的是,必须对数字系列的每个值进行硬编码的限制也限制了将要处理的列的字数(在本例中为12)。如果系列中有太多数字并不重要,您可以随时添加更多数字来覆盖更大的预期输入文本。

答案 1 :(得分:0)

尝试选择所有职位并将其作为数组返回。然后在php中做这样的事情:

<?php
$array = array("Business Development Executive Cold Calling & Cold Emailing expert  Entrepreneur ", "Director of Online Marketing and entrepreneur", "Art Director and Entrepreneur", "Corporate Development at Yahoo!", "Snr Program Manager, Yahoo");
$words = "";
foreach($array as $val) $words .= " ".strtolower($val);
print_r(array_count_values(str_word_count($words, 1)));
?>

将输出:

Array ( [business] => 1 [development] => 2 [executive] => 1 [cold] => 2 [calling] => 1 [emailing] => 1 [expert] => 1 [entrepreneur] => 3 [director] => 2 [of] => 1 [online] => 1 [marketing] => 1 [and] => 2 [art] => 1 [corporate] => 1 [at] => 1 [yahoo] => 2 [snr] => 1 [program] => 1 [manager] => 1 )