使用KL Divergence- python 3.x查找相似的文档

时间:2018-11-02 04:12:19

标签: python dictionary scikit-learn text-mining entropy

我想在train_dict中找到给定test_dict的类似主题。我有两个字典-train_dict和test_dict。我不确定如何为test_dict中的每个文档找到与train_dict类似或接近的主题。我发现KL散度是一种用于此目的的技术。但是我不确定如何在这种情况下使用它。

train_dict =  {490514.0: {0: 0.039169986,
  1: 0.023344912,
  2: 0.028936442,
  3: 0.022125904,
  4: 0.040051,
  5: 0.030525777,
  6: 0.06751838,
  7: 0.59827864,
  8: 0.023744604,
  9: 0.04026981,
  10: 0.044118173,
  11: 0.041916344},
 489733.0: {0: 0.012707975,
  1: 0.5981753,
  4: 0.012993803,
  6: 0.021207014,
  7: 0.010705788,
  9: 0.07442666,
  10: 0.22201125,
  11: 0.01359898},
 497410.0: {0: 0.012707975,
  1: 0.5981752,
  4: 0.012993803,
  6: 0.021207014,
  7: 0.010705788,
  9: 0.07442666,
  10: 0.22201134,
  11: 0.01359898}}

test_dict =  {85.0: {0: 0.28180935978889465,
  1: 0.02879604697227478,
  2: 0.0356932207942009,
  3: 0.027292393147945404,
  4: 0.2815341353416443,
  5: 0.03765367344021797,
  6: 0.08200311660766602,
  7: 0.04070392623543739,
  8: 0.029300140216946602,
  9: 0.04947005212306976,
  10: 0.05403999984264374,
  11: 0.051703985780477524},
 86.0: {0: 0.28180935978889465,
  1: 0.028796043246984482,
  2: 0.0356932170689106,
  3: 0.027292391285300255,
  4: 0.2815358638763428,
  5: 0.03765366971492767,
  6: 0.08200132846832275,
  7: 0.040703922510147095,
  8: 0.02930011972784996,
  9: 0.049470048397779465,
  10: 0.05403999239206314,
  11: 0.05170397832989693}}

找到列车指令和测试指令之间的Kuller散度。我想从火车dict值中找到最接近的2个测试dict的点。我不确定如何计算。

0 个答案:

没有答案