使用SQL查找最常用单词的最高频率

时间:2014-07-21 08:23:09

标签: php mysql sql count

我正在为我的应用程序使用MySQL Workbench(包含JavaScript和PHP)。我有一个包含推文的SQL数据库。我想查询推文(句子)中单词的最高频率。我做过研究,说使用count()来查询,但我仍然无法得到我想要的东西。

示例数据集:

tweetsID |  Tweets                                           | DateTime
   1     | I can't wait to go to school tomorrow!            | 2014-07-18 12:00:00
   2     | My teacher saw me hanging out after school        | 2014-07-18 12:20:00
   3     | I had Pepper Lunch for my dinner                  | 2014-07-18 12:30:00
   4     | Something happened in my school omg               | 2014-07-18 12:40:00
   5     | This project is so hard!                          | 2014-07-18 12:50:00

预期产出:

Words   |frequency
  I     |2
 can't  |1
wait    |1
 to     |2
school  |3
tomorrow|1
  !     |2
 my     |3
had     |1
teacher |1
saw     |1
 me     |1
hanging |1
out     |1
after   |1
pepper  |1
lunch   |1
for     |1
dinner  |1
something|1
happened |1
in       |1
  omg    |1
 this    |1
project  |1
  is     |1
  so     |1
 hard    |1

我在以下链接中创建了示例数据:

[http://sqlfiddle.com/#!2/3b3f2/1]

任何人都可以教我或任何参考我的指导?提前谢谢。

3 个答案:

答案 0 :(得分:3)

我认为你最好的选择是在PHP中这样做。想到array_count_values()

试试这个:

$sqlresults = array(
    "I can't wait to go to school tomorrow!",          
    "My teacher saw me hanging out after school",  
    "I had Pepper Lunch for my dinner",               
    "Something happened in my school omg",            
    "This project is so hard!"
);  

$arr = array();
foreach ($sqlresults as $str) {
    $arr = array_merge($arr, explode(' ', $str));    
}

$arr = array_count_values($arr);

print_r($arr);

See demo


参考文献:

答案 1 :(得分:0)

我想说你需要重构你的数据库。

我会介绍一个单独的表格 - words (id, word)和一个关系表tweet_to_word (tweet_id, word_id, word_count),您可以在其中保留每条推文的所有字词。

之后它将是一个简单的

select count(ttw.word_count)
from tweet_to_word ttw 
group by word_id

你可以在选择中添加ORDER BY来找到最受欢迎的单词

答案 2 :(得分:0)

为了证明这可能会带来多么混乱,以下内容几乎可以在单个SQL中执行您想要的操作。

首先用空格替换标点符号,然后用1个空格(几次)替换2个空格。这个想法是给你一个字符串,其中包含由单个空格分隔的单词。

通过比较长度与长度,空格被替换为空来计算单词的数量。

然后交叉连接,选择获取数字0到9,当交叉连接时,每个推文为您提供1000行,以及数字0到999.然后将其与SUBSTRING_INDEX一起使用以获取所有单个单词。将数字(从0到999)与推文中的单词数进行比较,以避免重复最后一个单词。

然后,这只用于普通的COUNT / GROUP BY来获取单词和计数。

SELECT Words, COUNT(*) AS frequency
FROM
(
    SELECT SUBSTRING_INDEX(SUBSTRING_INDEX(Tweets, ' ', 1 + units.i + tens.i * 10 + hundreds.i * 100), ' ', -1) AS Words
    FROM (SELECT 0 i UNION SELECT 1 UNION SELECT 2 UNION SELECT 3 UNION SELECT 4 UNION SELECT 5 UNION SELECT 6 UNION SELECT 7 UNION SELECT 8 UNION SELECT 9) units 
    CROSS JOIN (SELECT 0 i UNION SELECT 1 UNION SELECT 2 UNION SELECT 3 UNION SELECT 4 UNION SELECT 5 UNION SELECT 6 UNION SELECT 7 UNION SELECT 8 UNION SELECT 9) tens
    CROSS JOIN (SELECT 0 i UNION SELECT 1 UNION SELECT 2 UNION SELECT 3 UNION SELECT 4 UNION SELECT 5 UNION SELECT 6 UNION SELECT 7 UNION SELECT 8 UNION SELECT 9) hundreds
    CROSS JOIN
    (
        SELECT Tweets,
                (LENGTH(Tweets) - LENGTH(REPLACE(Tweets, ' ', ''))) + 1 AS Tweets_Words
        FROM
        (
            SELECT TRIM(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(Tweets, '(', ' '), ')', ' '), ',', ' '), '.', ' '), ';', ' '), ':', ' '), '?', ' '), '!', ' '), '{', ' '), '}', ' '), '  ', ' '), '  ', ' '), '  ', ' '), '  ', ' ')) AS Tweets
            FROM some_tweets
        ) sub0
    ) sub1
    WHERE Tweets_Words > (units.i + tens.i * 10 + hundreds.i * 100)
) sub2
GROUP BY Words

用一个空格替换双重空格可能会被删除,取而代之的是检查结果单词不是'': -

SELECT Words, COUNT(*) AS frequency
FROM
(
    SELECT SUBSTRING_INDEX(SUBSTRING_INDEX(Tweets, ' ', 1 + units.i + tens.i * 10 + hundreds.i * 100), ' ', -1) AS Words
    FROM (SELECT 0 i UNION SELECT 1 UNION SELECT 2 UNION SELECT 3 UNION SELECT 4 UNION SELECT 5 UNION SELECT 6 UNION SELECT 7 UNION SELECT 8 UNION SELECT 9) units 
    CROSS JOIN (SELECT 0 i UNION SELECT 1 UNION SELECT 2 UNION SELECT 3 UNION SELECT 4 UNION SELECT 5 UNION SELECT 6 UNION SELECT 7 UNION SELECT 8 UNION SELECT 9) tens
    CROSS JOIN (SELECT 0 i UNION SELECT 1 UNION SELECT 2 UNION SELECT 3 UNION SELECT 4 UNION SELECT 5 UNION SELECT 6 UNION SELECT 7 UNION SELECT 8 UNION SELECT 9) hundreds
    CROSS JOIN
    (
        SELECT Tweets,
                (LENGTH(Tweets) - LENGTH(REPLACE(Tweets, ' ', ''))) + 1 AS Tweets_Words
        FROM
        (
            SELECT REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(Tweets, '(', ' '), ')', ' '), ',', ' '), '.', ' '), ';', ' '), ':', ' '), '?', ' '), '!', ' '), '{', ' '), '}', ' ') AS Tweets
            FROM some_tweets
        ) sub0
    ) sub1
    WHERE Tweets_Words > (units.i + tens.i * 10 + hundreds.i * 100)
) sub2
WHERE Words != ''
GROUP BY Words

SQL在这里小提琴: -

http://www.sqlfiddle.com/#!2/f28e5/1