Question

我正在使用由两列组成的单表数据库：整数wordID和varchar字。该表长达数千行，是通过以编程方式读取大量文本并在空格上分割创建的，然后将单个单词解释并将其插入数据库中。目标是使用此词典来阅读全文博客帖子，推文，其他文本内容并为相关性评分。

我想要做的是计算每个单词的数量（我自己工作的单词）以及每个单词的“得分” - 也就是说，单词糊就显示出最小数量数据集中的时间具有分数，分数是单词频率的倒数，以1-10为单位。我们的想法是，一个词出现的次数越多，我在后面的文本搜索中所用的价值就越低。然而，为了有用，它必须出现最少次数，因为一次性可能是一个错字。

这是我的选择语句，并尝试在生成计数的同时对单词频率进行评分。

  select word, 
  count(word), 
  10*(((max(count(word))+1) - count(word))/(max(count(word))))
  from dictwords where length(word)>3 group by word having count(word)>35 
  order by count(word) desc;

mysql返回的错误是“无效使用组功能”。错误1111。

是否可以在mySQL的一个语句中执行此类操作？或者我应该通过选择并将我的结果表提供给占位符表然后尝试对其进行评分来将计数和得分分解为两个查询？

Answer 1

我不认为您可以在一个查询中执行此操作，因为您正在尝试查找最常见单词出现的次数（我认为）。这对我来说对测试数据集很有用：

# get the number of times the most common word occurs
select @maxCount := count(word)
from temp 
where length(word)>3 
group by word 
having count(word)>10
order by count(word) desc
limit 1;

# now use that max value to calculate a score
select 
    word, 
    count(word) as wordCount,
    @maxCount as maxWordCount,
    10*(((@maxCount+1) - count(word))/(@maxCount)) as score
from temp 
where length(word)>3 
group by word 
having wordCount>10
order by wordCount desc;

sqlfiddle here如果你想看看我是否正确。

Answer 2

  drop table if exists wordcount;

  create table wordcount(
   word varchar(50) primary key,
   wc   int     not null
  );

  insert into wordcount (word, wc)
  select word, count(word)
  from dictwords 
  where length(word)>3 
  group by word 
  having count(word)>35 
  order by count(word) desc;


  drop table if exists wordscore;
  create table wordscore(
  word  varchar(50) primary key,
  score int     not null);

  insert into wordscore (word, score)
  select word, (1-(10*(((max(wc)+1) - wc)/(max(wc)))))*10
  from wordcount 
  group by word;

不得不在这里创建一张桌子 - 但我得到了它。由于我只查看原始数据中包含35个实例或更多实例的单词，因此我们在此结果集中得到的结果为7-10。

在MySQL查询中评分单词频率

2 个答案: