Question

我正在分析一个小的语料库，并且想要基于50万个文本文件创建字典。

这些文本文件由带编号的行组成，这些行是制表符分隔的列，带有一些字符串（或数字），即：

1    string1    string2    string3    # ...and so on, but I only need columns 2-4
2    string1    string2    string3    # ...and so on...
3    string1    string2    string3    # ...and so on...
4    string1    string2    string3    # ...and so on...
# ...and so on...

我只是在简化它，这些词不一定在每一行中都是相同的，但它们确实在整个语料库中重复出现。

我想创建一个字典，其中第二列（带有“ string1”）作为键，第三列和第四列作为该键的值，而且还要包含该语料库中特定键的所有重复的总和。

应该是这样的：

my_dict = {
    "string1": [99, "string2", "string3"],
    "string1": [51, "string2", "string3"],
    # ...and so on...
}

因此，“ string1”代表令牌，数字是这些令牌的计数器，“ string2”代表引理，“ string3”代表类别（其中一些需要省略，如下面的代码所示）。 / p>

我已经设法（在stackoverflow的很大帮助下）写了一些代码：

import os
import re
import operator

test_paths = ["path1", "path2", "path3"]
cat_to_omit = ["CAT1", "CAT2"]
tokens = {}

for path in test_paths:
    dirs = os.listdir(path)
    for file in dirs:
        file_path = path + file
        with open(file_path) as f:
            for line in f:
                if re.match(r"^\d+.*", line): #selecting only lines starting with numbers, because some of them don't, and I don't need these
                    check = line.split()[3]
                    if check not in cat_to_omit: #omitting some categories that I don't need
                        token_lst = line.lower().split()[1]
                        for token in token_lst.split():
                            tokens[token] = tokens.get(token, 0) + 1
print(tokens)

现在，我只获得“ string1”（这是一个令牌）作为键+计数器，以表示此令牌在我的主体中出现的值。如何为每个键（令牌）添加3个值的列表： 1.计数器，我已经将其作为每个键的唯一值， 2.引理，应取自第3列（“ string2”）， 3.类别，应从第4列（“ string3”）中获取。

似乎我只是不明白，如何将我的“键：值”字典转换为“键：3个值”字典。

将文件的选定列作为值添加到字典中

0 个答案: