我有一个文本语料库:来自包含各种句子和段落的文件
这是我的代码:
import re
import nltk
from nltk.tokenize import RegexpTokenizer
import math
from collections import Counter
with open("descriptionsample.tsv", "r") as openfile:
frequency = Counter()
stopwords = nltk.corpus.stopwords.words('english')
tokenizer = RegexpTokenizer("[\w’]+", flags=re.UNICODE)
for line in openfile:
words = line.lower().strip()
words=re.sub(r'[0-9]|\~|\`|\@|\#|\$|\%|\^|\&|\*|\(|\)|\_|\+|\=|\{|\[|\}|\]|\\|\<|\,|\<|\.|\>|\?|\/|\;|\:', '',words).replace('-',' ')
tokens = tokenizer.tokenize(words)
tokens = [token for token in tokens if token not in stopwords]
frequency.update(tokens)
我的结果采用了计数器格式
{'code':32344,'sql':2123,'chicago':1233...........} etc.
但是假设在文档的第一行执行单词频率的结果是:
{'code':10,'sql':3,'python':2........}
我要做的是按文档创建元组中的共现矩阵(与bi-gram / tri-gram等相反),然后在最后收集总和。基本上将每个键的计数附加到由Key1创建的新创建的元组,Key2:Key2的值。其中key2甚至可以是key1。
因此,在计算tsv文件的每一行中的单词频率后,我希望by行结果看起来像这样:
{('code','code'):10,('code','sql)':3,('code','python'):2,('sql,'code'):10,('sql','sql'):3,('sql','python'):2,('python','code'):10,('python','sql'):3,('python','python'):2}
我无法理解。有帮助吗?也许我正在忽视那里的其他图书馆。
答案 0 :(得分:0)
一位同事为我找到了答案。我最初尝试过嵌套词典的图层和图层,但遍历它本身就是一场噩梦。因此,这对解决我的问题更加简单和有效:
doc2= {
'a': 1,
'b': 2,
'c': 3,
'd': 4,
'e': 5
}
res = {}
for key1 in doc2.keys():
for key2 in doc2.keys():
if key1 != key2:
res[(key1, key2)] = doc2[key2]
for key in res:
print("[{}, {}] = {}".format(key[0], key[1], res[key]))
结果:
[b, c] = 3
[d, a] = 1
[b, a] = 1
[d, c] = 3
[e, d] = 4
[c, d] = 4
[d, e] = 5
[c, e] = 5
[e, c] = 3
[c, a] = 1
[a, d] = 4
[e, b] = 2
[a, e] = 5
[d, b] = 2
[c, b] = 2
[a, b] = 2
[e, a] = 1
[b, e] = 5
[a, c] = 3
[b, d] = 4