矩阵创建

时间:2015-03-24 01:19:18

标签: python matlab matrix

我有很多包含字符串的文档,如下所示。

[('ADVP', 'RB'), ('NP', 'NN'), ('NP', 'DT'), ('NP', 'NN'), ('NP', 'NN'),   ('PP',
'TO'), ('NP', 'PRP'), ('NP', 'RB'), ('NP', 'CD'), ('NP', 'JJ'), ('NP', 'NN'), ('
PP', 'IN'), ('NP', 'NNS'), ('ADVP', 'RB'), ('NP', 'PRP'), ('PP', 'IN'), ('NP', '
DT'), ('NP', 'NN'), ('NP', 'NN'), ('NP', 'DT'), ('NP', 'NN'), ('ADVP', 'RB'), ('
NP', 'DT'), ('NP', 'JJ'), ('NP', 'NN'), ('WHNP', 'WDT'), ('NP', 'JJS'), ('NP', '
CD'), ('NP', 'PRP'), ('VP', 'VBP'), ('NP', 'DT'), ('NP', 'NNS'), ('NP', 'PRP'),
('VP', 'VBD'), ('NP', 'DT'), ('NP', 'NN'), ('WHADVP', 'WRB'), ('NP', 'DT'), ('NP
', 'NNS'), ('NP', 'RB'), ('NP', 'DT'), ('NP', 'NNS'), ('PRT', 'RP'), ('NP', 'PRP
'), ('ADVP', 'RB'), ('NP', 'DT'), ('NP', 'NN'), ('NP', 'PRP'), ('PP', 'IN'), ('N
P', 'NN'), ('PP', 'IN'), ('NP', 'NN'), ('PP', 'IN'), ('NP', 'NN')]

我想在excel中创建一个矩阵,其中每个独特的句法类别对都像(' ADVP',' RB'),(' NP', ' NN'),(' NP',' DT')充当各自频率的列标题。

其次,第三个文档可能包含文档类别对,而文档类型不存在。因此,不存在的句法对必须附加在列标题中。

最后,我想创建一个矩阵,其中列指定语法对,行指定不同的文档。矩阵中的每个条目Mij应指示第i个文档中第j个句法对出现的频率。

1 个答案:

答案 0 :(得分:0)

您可以使用collections模块计算对的频率

import collections
doc1 = [('ADVP', 'RB'), ('NP', 'NN'), ('NP', 'DT'), ('NP', 'NN'), ('NP', 'NN'),   ('PP','TO'), ('NP', 'PRP'), ('NP', 'RB'), ('NP', 'CD'), ('NP', 'JJ'), ('NP', 'NN'), ('PP', 'IN'), ('NP', 'NNS'), ('ADVP', 'RB'), ('NP', 'PRP'), ('PP', 'IN'), ('NP', 'DT'), ('NP', 'NN'), ('NP', 'NN'), ('NP', 'DT'), ('NP', 'NN'), ('ADVP', 'RB'), ('NP', 'DT'), ('NP', 'JJ'), ('NP', 'NN'), ('WHNP', 'WDT'), ('NP', 'JJS'), ('NP', 'CD'), ('NP', 'PRP'), ('VP', 'VBP'), ('NP', 'DT'), ('NP', 'NNS'), ('NP', 'PRP'),('VP', 'VBD'), ('NP', 'DT'), ('NP', 'NN'), ('WHADVP', 'WRB'), ('NP', 'DT'), ('NP', 'NNS'), ('NP', 'RB'), ('NP', 'DT'), ('NP', 'NNS'), ('PRT', 'RP'), ('NP', 'PRP'), ('ADVP', 'RB'), ('NP', 'DT'), ('NP', 'NN'), ('NP', 'PRP'), ('PP', 'IN'), ('NP', 'NN'), ('PP', 'IN'), ('NP', 'NN'), ('PP', 'IN'), ('NP', 'NN')]
count1 = collections.Counter(doc1)

这会给你

count1.keys()
>>>[('PP', 'IN'), ('WHADVP', 'WRB'), ('NP', 'NNS'), ('WHNP', 'WDT'), ('NP', 'NN'), ('NP', 'JJS'), ('NP', 'DT'), ('NP', 'CD'), ('ADVP', 'RB'), ('PRT', 'RP'), ('VP', 'VBD'), ('NP', 'JJ'), ('NP', 'RB'), ('VP', 'VBP'), ('NP', 'PRP'), ('PP', 'TO')]

count1.values()
>>>[5, 1, 4, 1, 13, 1, 9, 2, 4, 1, 1, 2, 2, 1, 6, 1]

为每个文档执行此操作

之后,您需要将值转换为具有树值的列表。 在这种情况下,numpy数组更容易处理。

import numpy as np

for key in pairs1.key()
     pairs1[key] = np.array([pairs1[key],0,0])

for key in pairs2.key()
     pairs2[key] = np.array([0,pairs2[key],0])

for key in pairs3.key()
     pairs3[key] = np.array([0,0,pairs3[key]])

然后将所有三个词典合并在一起:

pairs = {}

for key in pairs1.keys():
    pairs[key] = pairs[key]

for key in pairs2.keys():
    try:
        pairs[key] = pairs[key] + pairs2[key]
    except KeyError:
        pairs[key] = pairs2[key]

for key in pairs3.keys():
    try:
        pairs[key] = pairs[key] + pairs3[key]
    except KeyError:
        pairs[key] = pairs3[key]

最后你可以给出你的矩阵

f = open('myfile.csv','w')
header = ''
for key in pairs.keys():
    if header == '':
        header = '%s' %pairs[key]
    else:
        header = '%s, %s' % (header, pairs[key])
f.write('%s\n' % header)

for i in range(4):
    line = ''
    for value in pairs.values():
        if line == '':
            line = '%s' %pairs[value][i]
        else:
            header = '%s, %s' % (header, pairs[value][i])
    f.write('%s\n' % line)
f.close()