Question

我的词汇形式为dic = {＆＃39; a＆＃39;：＆＃39;＆＃39; ....，....}，关键是单词，值是它的字数。

我有一些句子，如：

＆＃34;这是一个测试＆＃34;

＆＃34;一个苹果＆＃34;

...

为了对句子进行标记，每个句子将被编码为字典的单词索引。如果句子中的单词也存在于词典中，请获取该单词的索引;否则将值设置为0。

例如，我将句子维度设置为6，如果句子的长度小于6，则填充0以使其为6维。

＆＃34;这是测试＆＃34; ----＆GT; [2,0,2,4,0,0] ＆＃34;一个苹果＆＃34; ----＆GT; [5,0,0,0,0,0，]

以下是我的示例代码：

words=['the','a','an'] #use this list as my dictionary
X=[]

with open('test.csv','r') as infile:
    for line in infile:
        for word in line:
            if word in words:
                X.append(words.index(word))
            else: X.append(0)

我的代码存在一些问题，因为输出不正确;另外，我不知道如何设置句子维度以及如何填充。

Answer 1

您的代码存在一些问题：

你不是在对一个单词进行标记，而是对一个角色进行标记。您需要将每一行拆分为单词
您将附加到一个大型列表中，而不是代表每个句子/行的列表列表
就像你说的那样，你不限制列表的大小
我也不明白你为什么使用列表作为词典

我在下面编辑了您的代码，我认为它与您的规范更符合：

words={'the': 2,'a': 1,'an': 3}
X=[]

with open('test.csv','r') as infile:
    for line in infile:
        # Inits the sublist to [0, 0, 0, 0, 0, 0]
        sub_X = [0] * 6

        # Enumerates each word in the list with an index
        # split() splits a string by whitespace if no arg is given
        for idx, word in enumerate(line.split()):
            if word in words:
                 # Check if the idx is within bounds before accessing
                 if idx < 6:
                     sub_X[idx] = words[word]

        # X represents the overall list and sub_X the sentence
        X.append(sub_X)

python句子根据字典的单词索引进行标记

1 个答案: