Python共同出现的单词和短语矩阵

时间:2016-03-15 02:58:31

标签: python python-2.7 numpy pandas matrix

我正在使用两个文本文件。一个包含58个单词(L1)的列表,另一个包含1173个短语(L2)。我想在for i in range(len(L1))中检查for j in range(len(L1))L2共现。

例如:

L1 = ['b', 'c', 'd', 'e', 't', 'w', 'x', 'y', 'z']
L2 = ['the onion', 'be your self', 'great zoo', 'x men', 'corn day']

for i in range(len(L1)):
    for j in range(len(L1)):
        for s in range(len(L2)):
            if L1[i] in L2[s] and L1[j] in L2[s]:
                output = L1[i], L1[j], L2[s]
                print output

输出(来自'be your self'的示例L2):

('b', 'b', 'be your self')
('b', 'e', 'be your self')
('b', 'y', 'be your self')
('e', 'b', 'be your self')
('e', 'e', 'be your self')
('e', 'y', 'be your self')
('y', 'b', 'be your self')
('y', 'e', 'be your self')
('y', 'y', 'be your self')

输出显示了我想要的内容,但是为了可视化数据,我还需要返回与L1[j]同意的时间L1[i]

例如:

  b e y
b 1 1 1
e 1 2 1
y 1 1 1

我应该使用pandas还是numpy才能返回此结果?

我发现了关于共生矩阵的这个问题,但我找不到具体的答案。 efficient algorithm for finding co occurrence matrix of phrases

谢谢!

3 个答案:

答案 0 :(得分:3)

这是一个使用itertools.product的解决方案。这应该比接受的解决方案明显更好(如果这是一个问题)。

from itertools import product
from operator import mul

L1 = ['b', 'c', 'd', 'e', 't', 'w', 'x', 'y', 'z']
L2 = ['the onion', 'be your self', 'great zoo', 'x men', 'corn day']

phrase_map = {}

for phrase in L2:
    word_count = {word: phrase.count(word) for word in L1 if word in phrase}

    occurrence_map = {}
    for perm in product(word_count, repeat=2):
        occurrence_map[perm] = reduce(mul, (word_count[key] for key in perm), 1)

    phrase_map[phrase] = occurrence_map

从我的时间来看,这在Python 3中快2-4倍(在Python 2中可能没那么改进)。此外,在Python 3中,您需要从reduce导入functools

编辑:请注意,虽然此实施相对简单,但效果明显不足。例如,我们知道相应的输出将是对称的,并且此解决方案不会利用它。使用combinations_with_replacements代替product将仅生成输出矩阵的上三角部分中的条目。因此,我们可以通过以下方式改进上述解决方案:

from itertools import combinations_with_replacement

L1 = ['b', 'c', 'd', 'e', 't', 'w', 'x', 'y', 'z']
L2 = ['the onion', 'be your self', 'great zoo', 'x men', 'corn day']

phrase_map = {}

for phrase in L2:
    word_count = {word: phrase.count(word) for word in L1 if word in phrase}

    occurrence_map = {}
    for x, y in combinations_with_replacement(word_count, 2):
        occurrence_map[(x,y)] = occurrence_map[(y,x)] = \
            word_count[x] * word_count[y]

    phrase_map[phrase] = occurrence_map

return phrase_map

正如预期的那样,此版本的版本只需前一版本的一半。请注意,此版本依赖于将自己限制为两个元素的对,而之前的版本没有。

请注意,如果行

,可以减少大约15-20%的运行时间
 occurrence_map[(x,y)] = occurrence_map[(y,x)] = ...

更改为

occurrence_map[(x,y)] = ...

但这可能不太理想,具体取决于您将来如何使用此映射。

答案 1 :(得分:1)

好的,你为什么不尝试这个?

from collections import defaultdict

L1 = ['b', 'c', 'd', 'e', 't', 'w', 'x', 'y', 'z']
L2 = ['the onion', 'be your self', 'great zoo', 'x men', 'corn day', 'yes be your self']

d = dict.fromkeys(L2)

for s, phrase in enumerate(L2):
    d[phrase] = defaultdict(int)
    for letter1 in phrase:
        for letter2 in phrase:
            if letter1 in L1 and letter2 in L1:
                output = letter1, letter2, phrase
                print output
                key = (letter1, letter2)
                d[phrase][key] += 1

print d

要捕获重复值,您需要遍历短语,列表L1,然后查看短语中的每个字母是否都在L1中(换句话说,交换in表达周围)。

输出:

{
'x men': defaultdict(<type 'int'>, {('e', 'e'): 1, ('e', 'x'): 1, ('x', 'x'): 1, ('x', 'e'): 1}),
'great zoo': defaultdict(<type 'int'>, {('t', 't'): 1, ('t', 'z'): 1, ('e', 'e'): 1, ('e', 'z'): 1, ('t', 'e'): 1, ('z', 'e'): 1, ('z', 't'): 1, ('e', 't'): 1, ('z', 'z'): 1}),
'the onion': defaultdict(<type 'int'>, {('e', 't'): 1, ('t', 'e'): 1, ('e', 'e'): 1, ('t', 't'): 1}),
'be your self': defaultdict(<type 'int'>, {('b', 'y'): 1, ('b', 'b'): 1, ('e', 'e'): 4, ('y', 'e'): 2, ('y', 'b'): 1, ('y', 'y'): 1, ('e', 'b'): 2, ('e', 'y'): 2, ('b', 'e'): 2}),
'corn day': defaultdict(<type 'int'>, {('d', 'd'): 1, ('y', 'd'): 1, ('d', 'y'): 1, ('y', 'y'): 1, ('y', 'c'): 1, ('c', 'c'): 1, ('c', 'y'): 1, ('c', 'd'): 1, ('d', 'c'): 1}),
'yes be your self': defaultdict(<type 'int'>, {('b', 'y'): 2, ('b', 'b'): 1, ('e', 'e'): 9, ('y', 'e'): 6, ('y', 'b'): 2, ('y', 'y'): 4, ('e', 'b'): 3, ('e', 'y'): 6, ('b', 'e'): 3})
}

答案 2 :(得分:1)

您可以尝试以下代码。

import collections, numpy
    tokens=['He','is','not','lazy','intelligent','smart']
    j=0
    a=np.zeros((len(tokens),len(tokens)))
    for pos,token in enumerate(tokens):
        j+=pos+1
        for token1 in tokens[pos+1:]:
            count = 0
            for sentence in [['He','is','not','lazy','He','is','intelligent','He','is','smart'] ]:
                    occurrences1 = [i for i,e in enumerate(sentence) if e == token1]
                    #print(token1,occurrences1)
                    occurrences2 = [i for i,e in enumerate(sentence) if e == token]
                    #print(token,occurrences2)
                    new1= np.repeat(occurrences1,len(occurrences2))
                    new2= np.asarray(occurrences2*len(occurrences1))
                    final_new= np.subtract(new1,new2)
                    final_abs_diff = np.absolute(final_new)
                    final_counts = collections.Counter(final_abs_diff)
                    count_1=final_counts[1]
                    count_2=final_counts[2]
                    count_0=final_counts[0]
                    count=count_1+count_2+count_0
            a[pos][j]=count
            #print(token,' ',pos,' ',token1,' ',j,' ',count)
            j+=1
        j=0

    final_mat = a.T+a
    print(final_mat)

输出为:

[[0. 4. 2. 1. 2. 1.]
 [4. 0. 1. 2. 2. 1.]
 [2. 1. 0. 1. 0. 0.]
 [1. 2. 1. 0. 0. 0.]
 [2. 2. 0. 0. 0. 0.]
 [1. 1. 0. 0. 0. 0.]]