计算表中最常用的单词,过滤掉停用词

时间:2014-11-08 01:15:28

标签: mysql

我创建了一个表格,其中填充的是人们在查看照片时首先想到的回复。我有~1400个条目。现在,我想看看最常见的描述是什么。

CREATE TABLE descript (
wordID int NOT NULL AUTO_INCREMENT PRIMARY KEY,
wordText TEXT(50)
)
ENGINE=MyISAM;

INSERT INTO descript VALUES(0,"Big");
INSERT INTO descript VALUES(0,"blue");
INSERT INTO descript VALUES(0,"blue");
INSERT INTO descript VALUES(0,"fast");
INSERT INTO descript VALUES(0,"impressive");
INSERT INTO descript VALUES(0,"big");
INSERT INTO descript VALUES(0,"big");
INSERT INTO descript VALUES(0,"red");
INSERT INTO descript VALUES(0,"his");
INSERT INTO descript VALUES(0,"her");
INSERT INTO descript VALUES(0,"His");
INSERT INTO descript VALUES(0,"Black");
INSERT INTO descript VALUES(0,"black");
INSERT INTO descript VALUES(0,"black");
INSERT INTO descript VALUES(0,"blue");
INSERT INTO descript VALUES(0,"a black");
INSERT INTO descript VALUES(0,"his");
INSERT INTO descript VALUES(0,"her");
INSERT INTO descript VALUES(0,"pleasant");
INSERT INTO descript VALUES(0,"the fast");
INSERT INTO descript VALUES(0,"blue");

以及之前和之后......

我必须这样做它是小写的,用这个来完成:

select LOWER(wordText) descript;

如何计算最常见的答案并显示它?我有一些停顿词(我不想被包含在计数中,例如' a"或者''。我如何不计算它们?

3 个答案:

答案 0 :(得分:1)

基本查询是:

SELECT lower(wordText) as word, count(*)
FROM descript
GROUP BY lower(wordText)
ORDER BY count(*) DESC
LIMIT 1;

如果要在查询中包含停用词,可以使用not in删除停用词:

SELECT lower(wordText) as word, count(*)
FROM descript
WHERE lower(wordText) not in ('a', 'the', . . . )
GROUP BY lower(wordText)
ORDER BY count(*) DESC
LIMIT 1;

或者,如果您将它们放在表格中:

SELECT lower(sw.wordText) as word, count(*)
FROM descript d left join
     stopwords sw
     on d.wordText = sw.word
WHERE sw.word is not null
GROUP BY lower(sw.wordText)
ORDER BY count(*) DESC
LIMIT 1;

您可以了解MySQL here中包含的停用词。

答案 1 :(得分:0)

如果你做了

SELECT COUNT(LOWER(wordText)) FROM descript GROUP BY LOWER(wordText);

你应该能够看到每个单词有多少。

您可以添加

ORDER BY

子句根据每个结果的计数来安排它们

答案 2 :(得分:0)

根据获取最常用的值,您可以使用此查询。

   SELECT wordText, count(*) FROM descript GROUP BY wordText  ORDER BY count(*) DESC LIMIT 1;