我的计数问题有些复杂。我正在尝试用语料库中的单词行和这些相同的单词列以及一组功能(例如复数,单数,过去时等)组成一个并发数据框架。
我已经开发了相关的字典。这些单词中的每一个都是字典,其中每个键都是一个单词或功能。像这样:
WordDict={Word1 :{word1:0, word2:0 ... feature1:0, feature2:0 ...}, Word2 :{word1:0, word2:0 ... feature1:0, feature2:0 ...} ...}
我也有一个语料库(词义化):
doc=['Word1', 'Word2', 'Word3' ...]
我还列出了带有令牌及其功能的列表:
meh=[['Word1', 'Feature1', 'Feature2', 'Feature3'], ['Word2', 'Feature1', 'Feature2', 'Feature3', 'Feature4' ], ['Word3', 'Feature1', 'Feature3']]
理想情况下,我想要的是一本看起来像这样的字典:
WordDict={Word1:{word1:0, word2:1 ... feature1:1, feature2:1 ...}, Word2:{word1:1, word2:0 ... feature1:1, feature2:1 ...} ...}
由于单词是引词,因此某些单词将在doc
中重复,但是在WordDict
中将只有一个条目。基本上我需要
遍历WordDict
中的每个meh
顶级密钥。
1a。对于meh
列表中每个顶级键中观察到的每个特征,将+1添加到WordDict
中的相关特征计数中。
遍历WordDict
中的每个doc
顶级密钥
2a。对于在左侧或右侧5个单位处看到的每个单词,将+1添加到相关单词计数WordDict
我已经考虑使用某种ngram窗口:
def windower(list, n):
for count,ele in enumerate(list):
if count-n < 0:
window=list[0:count+n]
else:
window=list[count-n:count+n]
因此,我认为从这里开始计算单词共现,我需要一种将window
中出现的单词添加到WordDict
中相关单词关键字中的方法
希望有人能提供帮助吗?
答案 0 :(得分:0)
我根据您的描述编写了以下代码。
但是2.
和2a.
对我来说很奇怪。我认为代码不是您想要的。
wordDict = {
"word1": {
"word1": 0,
"word2": 0,
"word3": 0,
"feature1": 0,
"feature2": 0,
"feature3": 0,
},
"word2": {
"word1": 0,
"word2": 0,
"word3": 0,
"feature1": 0,
"feature2": 0,
"feature3": 0,
},
"word3": {
"word1": 0,
"word2": 0,
"word3": 0,
"feature1": 0,
"feature2": 0,
"feature3": 0,
},
}
# some will be repeated you say?
doc = ["word1", "word1", "word2", "word3"]
meh = [["word1", "feature2", "feature3"], ["word2", "feature2"], ["word3", "feature1"]]
for word, wf in wordDict.items():
# 1a starts
found = False
for m in meh:
if m[0] == word:
found = True
for f in m[1:]:
wf[f] += 1
if found:
break
# 1a ends
# 2a starts
docLen = len(doc)
for i, d in enumerate(doc):
# 5 to the left, excluding itself
for j in range(max(0, i - 5), i):
wf[doc[j]] += 1
# 5 to the right, excluding itself
for j in range(i + 1, min(i + 6, docLen)):
wf[doc[j]] += 1
# 2a ends
print(wordDict)
# {'word1': {'word1': 6, 'word2': 3, 'word3': 3, 'feature1': 0, 'feature2': 1, 'feature3': 1}, 'word2': {'word1': 6, 'word2': 3, 'word3': 3, 'feature1': 0, 'feature2': 1, 'feature3': 0}, 'word3': {'word1': 6, 'word2': 3, 'word3': 3, 'feature1': 1, 'feature2': 0, 'feature3': 0}}