我有很多包含字符串的文档,如下所示。
[('ADVP', 'RB'), ('NP', 'NN'), ('NP', 'DT'), ('NP', 'NN'), ('NP', 'NN'), ('PP',
'TO'), ('NP', 'PRP'), ('NP', 'RB'), ('NP', 'CD'), ('NP', 'JJ'), ('NP', 'NN'), ('
PP', 'IN'), ('NP', 'NNS'), ('ADVP', 'RB'), ('NP', 'PRP'), ('PP', 'IN'), ('NP', '
DT'), ('NP', 'NN'), ('NP', 'NN'), ('NP', 'DT'), ('NP', 'NN'), ('ADVP', 'RB'), ('
NP', 'DT'), ('NP', 'JJ'), ('NP', 'NN'), ('WHNP', 'WDT'), ('NP', 'JJS'), ('NP', '
CD'), ('NP', 'PRP'), ('VP', 'VBP'), ('NP', 'DT'), ('NP', 'NNS'), ('NP', 'PRP'),
('VP', 'VBD'), ('NP', 'DT'), ('NP', 'NN'), ('WHADVP', 'WRB'), ('NP', 'DT'), ('NP
', 'NNS'), ('NP', 'RB'), ('NP', 'DT'), ('NP', 'NNS'), ('PRT', 'RP'), ('NP', 'PRP
'), ('ADVP', 'RB'), ('NP', 'DT'), ('NP', 'NN'), ('NP', 'PRP'), ('PP', 'IN'), ('N
P', 'NN'), ('PP', 'IN'), ('NP', 'NN'), ('PP', 'IN'), ('NP', 'NN')]
我想在excel中创建一个矩阵,其中每个独特的句法类别对都像(' ADVP',' RB'),(' NP', ' NN'),(' NP',' DT')充当各自频率的列标题。
其次,第三个文档可能包含文档类别对,而文档类型不存在。因此,不存在的句法对必须附加在列标题中。
最后,我想创建一个矩阵,其中列指定语法对,行指定不同的文档。矩阵中的每个条目Mij应指示第i个文档中第j个句法对出现的频率。
答案 0 :(得分:0)
您可以使用collections
模块计算对的频率
import collections
doc1 = [('ADVP', 'RB'), ('NP', 'NN'), ('NP', 'DT'), ('NP', 'NN'), ('NP', 'NN'), ('PP','TO'), ('NP', 'PRP'), ('NP', 'RB'), ('NP', 'CD'), ('NP', 'JJ'), ('NP', 'NN'), ('PP', 'IN'), ('NP', 'NNS'), ('ADVP', 'RB'), ('NP', 'PRP'), ('PP', 'IN'), ('NP', 'DT'), ('NP', 'NN'), ('NP', 'NN'), ('NP', 'DT'), ('NP', 'NN'), ('ADVP', 'RB'), ('NP', 'DT'), ('NP', 'JJ'), ('NP', 'NN'), ('WHNP', 'WDT'), ('NP', 'JJS'), ('NP', 'CD'), ('NP', 'PRP'), ('VP', 'VBP'), ('NP', 'DT'), ('NP', 'NNS'), ('NP', 'PRP'),('VP', 'VBD'), ('NP', 'DT'), ('NP', 'NN'), ('WHADVP', 'WRB'), ('NP', 'DT'), ('NP', 'NNS'), ('NP', 'RB'), ('NP', 'DT'), ('NP', 'NNS'), ('PRT', 'RP'), ('NP', 'PRP'), ('ADVP', 'RB'), ('NP', 'DT'), ('NP', 'NN'), ('NP', 'PRP'), ('PP', 'IN'), ('NP', 'NN'), ('PP', 'IN'), ('NP', 'NN'), ('PP', 'IN'), ('NP', 'NN')]
count1 = collections.Counter(doc1)
这会给你
count1.keys()
>>>[('PP', 'IN'), ('WHADVP', 'WRB'), ('NP', 'NNS'), ('WHNP', 'WDT'), ('NP', 'NN'), ('NP', 'JJS'), ('NP', 'DT'), ('NP', 'CD'), ('ADVP', 'RB'), ('PRT', 'RP'), ('VP', 'VBD'), ('NP', 'JJ'), ('NP', 'RB'), ('VP', 'VBP'), ('NP', 'PRP'), ('PP', 'TO')]
count1.values()
>>>[5, 1, 4, 1, 13, 1, 9, 2, 4, 1, 1, 2, 2, 1, 6, 1]
为每个文档执行此操作
之后,您需要将值转换为具有树值的列表。 在这种情况下,numpy数组更容易处理。
import numpy as np
for key in pairs1.key()
pairs1[key] = np.array([pairs1[key],0,0])
for key in pairs2.key()
pairs2[key] = np.array([0,pairs2[key],0])
for key in pairs3.key()
pairs3[key] = np.array([0,0,pairs3[key]])
然后将所有三个词典合并在一起:
pairs = {}
for key in pairs1.keys():
pairs[key] = pairs[key]
for key in pairs2.keys():
try:
pairs[key] = pairs[key] + pairs2[key]
except KeyError:
pairs[key] = pairs2[key]
for key in pairs3.keys():
try:
pairs[key] = pairs[key] + pairs3[key]
except KeyError:
pairs[key] = pairs3[key]
最后你可以给出你的矩阵
f = open('myfile.csv','w')
header = ''
for key in pairs.keys():
if header == '':
header = '%s' %pairs[key]
else:
header = '%s, %s' % (header, pairs[key])
f.write('%s\n' % header)
for i in range(4):
line = ''
for value in pairs.values():
if line == '':
line = '%s' %pairs[value][i]
else:
header = '%s, %s' % (header, pairs[value][i])
f.write('%s\n' % line)
f.close()