Question

我在使用dict创建循环时遇到问题。我有一本字典：键是唯一的数字，值是单词。我需要创建一个矩阵：行是句子的数字，列是单词的唯一数字（来自字典）。矩阵的元素将显示每个句子中每个单词的数量。这是我的dict创建者的代码。（一开始我有一个带句子的原始文本文件）

with open ('sentences.txt', 'r') as file_obj:
    lines=[]
    for line in file_obj:
        line_split=re.split('[^a-z]',line.lower().strip()
        j=0
        new_line=[]
        while j<=len(line_split)-1:
            if (line_split[j]):
                new_line.append(line_split[j])
            j+=1            
        lines.append(new_line)    
    vocab = {}
    k = 1
    for i in range(len(lines)):
        for j in range(len(lines[i])):
            if lines[i][j] not in vocab.values():
                vocab[k]=lines[i][j]
                k+=1

import numpy as np  //now I am trying to create a matrix
matr = np.array(np.zeros((len(lines),len(vocab))))  
m=0
l=0
while l<22:
    for f in range (len(lines[l])):
        if vocab[1]==lines[l][f]:   //this works only for the 1 word in dict
            matr[l][0]+=1
    l+=1
print(matr[3][0])

matr = np.array(np.zeros((len(lines),len(vocab))))   // this also works
for values in range (len(vocab)):
    for line in lines:
        a=line.count(vocab[1])
        print(a)

但是当我试图制作一个循环来完成这个词典时，没有任何作用！你能告诉我如何填满整个矩阵吗？非常感谢你提前！

Answer 1

一些粗心的错误：第7行需要一个右括号，//不是Python语法。

查看你的代码我不知道你的通用算法是什么，只是创建一个基本的字数字典。所以我提出这个更短的代码：

import re
import sys

def get_vocabulary (filename):
  vocab_dict = {}

  with open (filename, 'r') as file_obj:
    for line in file_obj:
      for word in re.findall(r'[a-z]+',line.lower()):
        if word in vocab_dict:   # see below for an interesting alternative
          vocab_dict[word] += 1
        else:
          vocab_dict[word] = 1
  return vocab_dict

if len(sys.argv) > 1:
  vocab = get_vocabulary (sys.argv[1])
  for word in vocab:
    print (word, '->', str(vocab[word]))

注意我已经替换了你自己的

line_split=re.split('[^a-z]',line.lower().strip())

反向

re.findall(r'[a-z]+',line.lower())

因为你的可以返回空元素，而我的不会。最初我必须在将其插入字典之前添加测试if word:，以防止添加大量空白。通过更好地检查'单词'，这是不再需要的。

（Python的乐趣：if..else的替代方案看起来像这一行：

vocab_dict[word] = 1 if word not in vocab_dict else vocab_dict[word]+1

效率略低，因为vocab_dict[word]必须检索两次 - 你不能单独说.. + 1。不过，这是一个很好的阅读线。）

使用<{3}}，可以使用

将字典转换为'矩阵'（实际上是一个简单的数组就足够了）

matrix = [[vocab[word], word] for word in sorted(vocab)]
for row in matrix:
  print (row)

Python循环通过dict

1 个答案: