计数窗口和特征中的单词

时间:2019-08-07 17:43:27

标签: python

我的计数问题有些复杂。我正在尝试用语料库中的单词行和这些相同的单词列以及一组功能(例如复数,单数,过去时等)组成一个并发数据框架。

我已经开发了相关的字典。这些单词中的每一个都是字典,其中每个键都是一个单词或功能。像这样:

WordDict={Word1 :{word1:0, word2:0 ... feature1:0, feature2:0 ...}, Word2 :{word1:0, word2:0 ... feature1:0, feature2:0 ...} ...}

我也有一个语料库(词义化):

doc=['Word1', 'Word2', 'Word3' ...]

我还列出了带有令牌及其功能的列表:

meh=[['Word1', 'Feature1', 'Feature2', 'Feature3'], ['Word2', 'Feature1', 'Feature2', 'Feature3', 'Feature4' ], ['Word3', 'Feature1', 'Feature3']]

理想情况下,我想要的是一本看起来像这样的字典:

WordDict={Word1:{word1:0, word2:1 ... feature1:1, feature2:1 ...}, Word2:{word1:1, word2:0 ... feature1:1, feature2:1 ...} ...}

由于单词是引词,因此某些单词将在doc中重复,但是在WordDict中将只有一个条目。基本上我需要

  1. 遍历WordDict中的每个meh顶级密钥。

    1a。对于meh列表中每个顶级键中观察到的每个特征,将+1添加到WordDict中的相关特征计数中。

  2. 遍历WordDict中的每个doc顶级密钥

    2a。对于在左侧或右侧5个单位处看到的每个单词,将+1添加到相关单词计数WordDict

我已经考虑使用某种ngram窗口:

def windower(list, n):
    for count,ele in enumerate(list):
        if count-n < 0:
            window=list[0:count+n]
        else:
            window=list[count-n:count+n]

因此,我认为从这里开始计算单词共现,我需要一种将window中出现的单词添加到WordDict中相关单词关键字中的方法

希望有人能提供帮助吗?

1 个答案:

答案 0 :(得分:0)

我根据您的描述编写了以下代码。

但是2.2a.对我来说很奇怪。我认为代码不是您想要的。

wordDict = {
    "word1": {
        "word1": 0,
        "word2": 0,
        "word3": 0,
        "feature1": 0,
        "feature2": 0,
        "feature3": 0,
    },
    "word2": {
        "word1": 0,
        "word2": 0,
        "word3": 0,
        "feature1": 0,
        "feature2": 0,
        "feature3": 0,
    },
    "word3": {
        "word1": 0,
        "word2": 0,
        "word3": 0,
        "feature1": 0,
        "feature2": 0,
        "feature3": 0,
    },
}

# some will be repeated you say?
doc = ["word1", "word1", "word2", "word3"]

meh = [["word1", "feature2", "feature3"], ["word2", "feature2"], ["word3", "feature1"]]

for word, wf in wordDict.items():

    # 1a starts
    found = False
    for m in meh:
        if m[0] == word:
            found = True

            for f in m[1:]:
                wf[f] += 1

        if found:
            break
    # 1a ends

    # 2a starts
    docLen = len(doc)
    for i, d in enumerate(doc):
        # 5 to the left, excluding itself
        for j in range(max(0, i - 5), i):
            wf[doc[j]] += 1
        # 5 to the right, excluding itself
        for j in range(i + 1, min(i + 6, docLen)):
            wf[doc[j]] += 1
    # 2a ends


print(wordDict)

# {'word1': {'word1': 6, 'word2': 3, 'word3': 3, 'feature1': 0, 'feature2': 1, 'feature3': 1}, 'word2': {'word1': 6, 'word2': 3, 'word3': 3, 'feature1': 0, 'feature2': 1, 'feature3': 0}, 'word3': {'word1': 6, 'word2': 3, 'word3': 3, 'feature1': 1, 'feature2': 0, 'feature3': 0}}