在MySQL中计算词频

时间:2018-09-03 10:28:34

标签: mysql tf-idf

我的Mysql表类似:

|DocumentID|                 Documents                           
============================================
0                   Penny bought bright blue fishes.
1                   Penny bought bright blue and orange fish.
2                   The cat ate a fish at the store.
3                   Penny went to the store. Penny ate a bug. Penn...
4                   It meowed once at the bug, it is still meowing...
5                   The cat is at the fish store. The cat is orang...
6                   Penny is a fish

现在,我要创建一个新表,其中列是所有文档和行中的唯一词,作为与每个DocumentID对应的值,等于

(number of times word appears in sentence) / (number of words in sentence)

类似:-

 DocumentID    ate      blue        bought       bright     bug         cat   fish            meow           once      orang      penni     saw           store            went
0             0.000000  0.200000    0.200000    0.200000    0.000000    0.000   0.200000    0.000000    0.000000    0.000000    0.200000    0.000000    0.000000    0.000000
1             0.000000  0.166667    0.166667    0.166667    0.000000    0.000   0.166667    0.000000    0.000000    0.166667    0.166667    0.000000    0.000000    0.000000
2             0.250000  0.000000    0.000000    0.000000    0.000000    0.250   0.250000    0.000000    0.000000    0.000000    0.000000    0.000000    0.250000    0.000000
3             0.111111  0.000000    0.000000    0.000000    0.111111    0.000   0.111111    0.000000    0.000000    0.000000    0.333333    0.111111    0.111111    0.111111
4             0.000000  0.000000    0.000000    0.000000    0.333333    0.000   0.166667    0.333333    0.166667    0.000000    0.000000    0.000000    0.000000    0.000000
5             0.000000  0.000000    0.000000    0.000000    0.000000    0.375   0.250000    0.125000    0.000000    0.125000    0.000000    0.000000    0.125000    0.000000
6             0.000000  0.000000    0.000000    0.000000    0.000000    0.000   0.500000    0.000000    0.000000    0.000000    0.500000    0.000000    0.000000    0.000000

我尝试了很多,但没有得到预期的结果。

0 个答案:

没有答案