Question

我正在从python中的文本文件中拆分单词。我收到了带有索引的行数（c）和字典（word_positions）。然后我创建一个零矩阵（c，索引）。这是代码：

from collections import defaultdict
import re
import numpy as np

c=0

f = open('/Users/Half_Pint_Boy/Desktop/sentenses.txt', 'r')

for line in f:
    c = c + 1

word_positions = {}

with open('/Users/Half_Pint_Boy/Desktop/sentenses.txt', 'r') as f:
    index = 0
    for word in re.findall(r'[a-z]+', f.read().lower()):
        if word not in word_positions:
            word_positions[word] = index
            index += 1
print(word_positions)

matrix=np.zeros(c,index)

我的问题：我如何填充矩阵才能获得此结果：matrix[c,index] = count，其中c - 是行数，index - 索引位置和{{ 1}} - 连续计算的单词数

Answer 1

尝试下一步：

import re
import numpy as np
from itertools import chain

text = open('/Users/Half_Pint_Boy/Desktop/sentenses.txt')

text_list = text.readlines()

c=0

for i in range(len(text_list)):
    c=c+1

text_niz = []

for i in range(len(text_list)):
    text_niz.append(text_list[i].lower()) # перевел к нижнему регистру

slovo = []

for j in range(len(text_niz)):
    slovo.append(re.split('[^a-z]', text_niz[j])) # токенизация

for e in range(len(slovo)):

    while slovo[e].count('') != 0:
        slovo[e].remove('') # удалил пустые слова

slovo_list = list(chain(*slovo))
print (slovo_list) # составил список слов

slovo_list=list(set(slovo_list)) # удалил повторяющиеся
x=len(slovo_list)

s = []

for i in range(len(slovo)):
    for j in range(len(slovo_list)):
        s.append(slovo[i].count(slovo_list[j])) # посчитал количество слов в каждом предложении

matr = np.array(s) # матрица вхождений слов в предложения
d = matr.reshape((c, x)) # преобразовал в матрицу 22*254

Answer 2

看起来您正在尝试创建类似于n-dimensional list的内容。这些是通过将列表嵌入其中来实现的：

two_d_list = [[0, 1], [1, 2], [example, blah, blah blah]]
words = two_d_list[2]
single_word = two_d_list[2][1]  # Notice the second index operator

这个概念在Python中非常灵活，也可以使用嵌套在里面的字典来完成：

two_d_list = [{"word":1}, {"example":1, "blah":3}]
words = two_d_list[1]  # type(words) == dict
single_word = two_d_list[2]["example"]  # Similar index operator, but for the dictionary

这实现了你想要的功能，但不使用语法matrix[c,index]，但是这种语法在python中并不存在用于索引。方括号内的逗号通常描述列表文字的元素。相反，您可以使用matrix[c][index] = count

访问行的词典元素

您可以重载索引运算符以实现所需的合成。 Here是关于实现您想要的语法的问题。总结：

在列表类的包装中重载__getitem__(self, inex)函数，并将函数设置为接受元组。可以创建没有括号的元组，给出语法matrix[c, index] = count

填充python矩阵

2 个答案: