获取表格特定列中特定单词的出现次数

时间:2011-12-21 17:02:50

标签: mysql

我差不多有200字。我想看看这些单词出现在表格列中的次数。

例如:假设我们对包含两行的列语句进行了表测试。

  1. 你好吗?我见到你已经很久了。
  2. 我很好,你好吗。
  3. 现在我想找到“你”和“怎么样”这两个词的出现。输出应该是这样的:

    word          count
    you            3
    how            2
    

    因为“你”有3个,两行中有2个出现。

    我该怎么做?

4 个答案:

答案 0 :(得分:0)

你可以这样做:

  1. 拆分短语并将所有项目放在不同的表格中;
  2. 删除所有ponctuation;
  3. 使用创建的表格和要识别的字词制作select

答案 1 :(得分:0)

我接近这个的方法是写一点user defined function来给我一个字符串出现在另一个字符串中的次数,但有一些允许:

  • 大写和小写
  • 常见标点符号

然后我会创建一个包含我想要搜索的所有单词的表格,即您的200个列表。然后使用该函数计算每个短语中每个单词的出现次数,将其放入内联视图中,然后按搜索词对结果求和。

因此:

用户定义的功能

DELIMITER $$

CREATE FUNCTION `get_word_count`(phrase VARCHAR(500),word VARCHAR(255), delimiter VARCHAR(1)) RETURNS int(11)
READS SQL DATA
BEGIN
 DECLARE cur_position INT DEFAULT 1 ; 
 DECLARE remainder TEXT;
 DECLARE cur_string VARCHAR(255);
 DECLARE delimiter_length TINYINT UNSIGNED;
 DECLARE total INT;
 DECLARE result DOUBLE DEFAULT 0;
 DECLARE string2 VARCHAR(255);

 SET remainder = replace(phrase,'!',' ');
 SET remainder = replace(remainder,'.',' ');
 SET remainder = replace(remainder,',',' ');
 SET remainder = replace(remainder,'?',' ');
 SET remainder = replace(remainder,':',' ');
 SET remainder = replace(remainder,'(',' ');

 SET remainder = lower(remainder);

 SET string2 = concat(delimiter,trim(word),delimiter);
 SET delimiter_length = CHAR_LENGTH(delimiter);
 SET cur_position = 1;

 WHILE CHAR_LENGTH(remainder) > 0 AND cur_position > 0 DO
    SET cur_position = INSTR(remainder, delimiter);
    IF cur_position = 0 THEN
        SET cur_string = remainder;
    ELSE
        SET cur_string = concat(delimiter,LEFT(remainder, cur_position - 1),delimiter);
    END IF;
    IF TRIM(cur_string) != '' THEN
        set result = result + (select instr(string2,cur_string) > 0);
    END IF;
    SET remainder = SUBSTRING(remainder, cur_position + delimiter_length);
 END WHILE;

 RETURN result;
END$$

DELIMITER ;

您可能需要稍微使用此功能,具体取决于您需要为标点符号和大小写做出哪些限制。希望你能在这里得到这个想法!

填充表格

create table search_word
(id int unsigned primary key auto_increment,
 word varchar(250) not null
);

insert into search_word (word) values ('you');
insert into search_word (word) values ('how');
insert into search_word (word) values ('to');
insert into search_word (word) values ('too');
insert into search_word (word) values ('the');
insert into search_word (word) values ('and');
insert into search_word (word) values ('world');
insert into search_word (word) values ('hello');

create table phrase_to_search
(id int unsigned primary key auto_increment,
phrase varchar(500) not null
);

insert into phrase_to_search (phrase) values ("How are you. It's been long since I met you");
insert into phrase_to_search (phrase) values ("I am fine how are you?");
insert into phrase_to_search (phrase) values ("Oh. Not bad. All is ok with the world, I think");
insert into phrase_to_search (phrase) values ("I think so too!");
insert into phrase_to_search (phrase) values ("You know what? I think so too!");

运行查询

select word,sum(word_count) as total_word_count
from
(
select phrase,word,get_word_count(phrase,word," ") as word_count
from search_word
join phrase_to_search
) t
group by word
order by total_word_count desc;

答案 2 :(得分:0)

这是一个解决方案:

SELECT SUM(total_count) as total, value
FROM (

SELECT count(*) AS total_count, REPLACE(REPLACE(REPLACE(x.value,'?',''),'.',''),'!','') as value
FROM (
SELECT SUBSTRING_INDEX(SUBSTRING_INDEX(t.sentence, ' ', n.n), ' ', -1) value
  FROM table_name t CROSS JOIN 
(
   SELECT a.N + b.N * 10 + 1 n
     FROM 
    (SELECT 0 AS N UNION ALL SELECT 1 UNION ALL SELECT 2 UNION ALL SELECT 3 UNION ALL SELECT 4 UNION ALL SELECT 5 UNION ALL SELECT 6 UNION ALL SELECT 7 UNION ALL SELECT 8 UNION ALL SELECT 9) a
   ,(SELECT 0 AS N UNION ALL SELECT 1 UNION ALL SELECT 2 UNION ALL SELECT 3 UNION ALL SELECT 4 UNION ALL SELECT 5 UNION ALL SELECT 6 UNION ALL SELECT 7 UNION ALL SELECT 8 UNION ALL SELECT 9) b
    ORDER BY n
) n
 WHERE n.n <= 1 + (LENGTH(t.sentence) - LENGTH(REPLACE(t.sentence, ' ', '')))
 ORDER BY value

) AS x
GROUP BY x.value

) AS y
GROUP BY value

以下是完整的工作小提琴:http://sqlfiddle.com/#!2/17481a/1

首先,我们通过@peterm进行查询以提取所有单词here(如果要自定义处理的单词总数,请按照他的说明操作)。然后我们将其转换为子查询,然后我们COUNTGROUP BY每个单词的值,然后在GROUP BY之上进行另一个查询,而不是分组的单词可能存在迹象。即:你好=你好!使用REPLACE

答案 3 :(得分:-1)

以下是您需要计算某些单词出现次数的简单解决方案,而不是完整的统计数据:

SELECT COUNT(*) FROM `words` WHERE `row1` LIKE '%how%';
SELECT COUNT(*) FROM `words` WHERE `row1` LIKE '%you%';