Question

我有两份文件。 Doc1的格式如下：

TOPIC:  0 5892.0
site 0.0371690427699
Internet 0.0261371350984
online 0.0229124236253
web 0.0218940936864
say 0.0159538357094

TOPIC:  1 12366.0
web 0.150331554262
site 0.0517548115801
say 0.0451237263464
Internet 0.0153647096879
online 0.0135856380398

......依此类推至主题99的相同模式。

Doc2的格式为：

0 0.566667 0 0.0333333 0 0 0 0.133333 ..........

依此类推......每个主题的每个值总共有100个值。

现在，我必须找到每个单词的加权平均概率，即：

P(w) = alpha.P(w1)+ alpha.P(w2)+...... +alpha.P(wn)

where alpha = value in the nth position corresponding to the nth topic.

对于单词＆＃34;说＆＃34;，概率应为

P(say) = 0*0.0159 + 0.5666*0.045+.......

同样，对于每个单词，我必须计算概率。

For  multiplication, if the word is taken from topic 0, then the 0th value from the doc2 must be considered and so on.

我只使用以下代码对单词的出现进行了计数，但从未使用过它们的值。所以，我很困惑。

 with open(doc2, "r") as f:
    with open(doc3, "w") as f1:

         words = " ".join(line.strip() for line in f)
         d = defaultdict(int)
         for word in words.split():  
              d[word] += 1
              for key, value in d.iteritems() :
                  f1.write(key+ ' ' + str(value) + ' ')
              print '\n'

我的输出应该如下：

 say = "prob of this word calculated by above formula"
 site = "
 internet = "

等等。

我做错了什么？

Answer 1

假设你忽略了TOPIC行，使用defaultdict对值进行分组，然后在最后进行计算：

from collections import defaultdict
from itertools import groupby, imap

d = defaultdict(list)
with open("doc1") as f,open("doc2") as f2:
    values = map(float, f2.read().split()) 
    for line in f:
        if line.strip() and not line.startswith("TOPIC"):
            name, val = line.split()
            d[name].append(float(val))

for k,v in d.items():
    print("Prob for {} is {}".format(k ,sum(i*j for i, j in zip(v,values)) ))

另一种方法是在你去的时候进行计算，每当你点击一个新的部分时增加一个计数，即一个带有TOPIC的行，通过索引从值获得正确的值：

from collections import defaultdict
d = defaultdict(float)
from itertools import  imap

with open("doc1") as f,open("doc2") as f2:
    # create list of all floats from doc2
    values = imap(float, f2.read().split())
    for line in f:
        # if we have a new TOPIC increase the ind to get corresponding ndex from values
        if line.startswith("TOPIC"):
            ind = next(values)
            continue
        # ignore empty lines
        if line.strip():
            # get word and float and multiply the val by corresponding values value
            name, val = line.split()
            d[name] += float(val) * values[ind]

for k,v in d.items():
    print("Prob for {} is {}".format(k ,v) )

在doc2中使用两个doc1内容和0 0.566667 0 0.0333333 0输出以下内容：

Prob for web is 0.085187930859
Prob for say is 0.0255701266375
Prob for online is 0.0076985327511
Prob for site is 0.0293277438137
Prob for Internet is 0.00870667394471

你也可以使用itertools groupby：

from collections import defaultdict
d = defaultdict(float)
from itertools import groupby, imap

with open("doc1") as f,open("doc2") as f2:
    values = imap(float, f2.read().split())
    # lambda x: not(x.strip()) will split into groups on the empty lines
    for ind, (k, v) in enumerate(groupby(f, key=lambda x: not(x.strip()))):
        if not k:
            topic = next(v) 
            #  get matching float from values
            f = next(values)
            # iterate over the group 
            for s in v:
                name, val = s.split()
                d[name] += (float(val) * f)
for k,v in d.iteritems():
    print("Prob for {} is {}".format(k,v))

对于python3，所有itertools imaps都应该更改为map，这也会在python3中返回一个迭代器。

如何确定单词的概率？

1 个答案: