Question

我有一个句子列表，其中每个句子都按单词分开。我的意思是，id = $_POST["id"]; a = $_POST["a"]; b = $_POST["b"]; c = $_POST["c"];看起来像这里：

sentences

每个句子都有一定长度，所以我发现[['word0', 'word1'], ['word2', 'word3', 'word4', 'word5'], ['word6', 'word7', 'word8'], ....]就像

max_sentence_len

我想要一个矩阵和数组。让我们来听听max_sentence_len=max(max_sentence_len, len(current_sentence))：

['word2, 'word3', 'word4', 'word5']

因此在矩阵中将是：

'word2'  ---> 'word3'
'word2' 'word3'  ---> 'word4'
'word2' 'word3' 'word4' ---> 'word5'

所有句子都要做！

首先，我要计算矩阵中有多少行：

matrix[0, 0] = 'word2' ---> array[0] = 'word3'
matrix[1 0] = 'word2', matrix[1 1] = 'word3' ---> array[1] = 'word4'
....

比我所解释的做矩阵和数组：

summ = 0
for line in sentences:
    summ += len(line)-1

train_x = np.zeros([summ, max_sentence_len], dtype=np.int32) train_y = np.zeros([summ], dtype=np.int32) ind = 0 for sentence in sentences: for i in range(len(sentence)-1): for j in range(i+1): train_x[ind, j] = word2idx(sentence[j]) train_y[ind] = word2idx(sentence[i+1]) ind += 1 print('train_x shape:', train_x.shape) print('train_y shape:', train_y.shape)仅在词汇表中给出单词索引。

效果很好！但是太长了（例如，如果word2idx更多，summ）

有什么方法可以更快地做到这一点？

UPD： 为了更好地理解，让我们看一下示例案例。让我们使用以下句子：“铅笔是红色的”，“它们很奇怪”，“没有遗产比诚实更丰富”。这样我的630000将是：

sentences

下一步（创建矩阵和数组）：

[['the', 'pencil', 'is', 'red'], 
 ['they', 'are', 'strange'],
 ['no', 'legacy', 'is', 'so', 'rich', 'as', 'honesty']]

因此，总结train_x --> train_y: the --> pencil the pencil --> is the pencil is --> red they --> are the are --> strange no --> legacy no legacy --> is no legacy is --> so no legacy is so --> rich no legacy is so rich --> as no legacy is so rich as --> honesty将是：

train_x

max_sentence_len = 7 [ [the 0 0 0 0 0 0] [the pencil 0 0 0 0 0 0] [the pencil is 0 0 0 0] [they 0 0 0 0 0 0] [they are 0 0 0 0 0] [no 0 0 0 0 0 0] [no legacy 0 0 0 0 0] [no legacy is 0 0 0 0] [no legacy is so 0 0 0] [no legacy is so rich 0 0] [no legacy is so rich as 0] ]： train_y

当然，矩阵和数组中的词并不完全-它们在词汇表（[pencil, is, red, are, strange, legacy, is, so, rich, as, honesty ]中建立索引创建词汇表可以很简单，例如：

word2idx

和vocab = [] for sentence in sentences: for word in sentence: if word not in vocab: vocab.append(word)很简单，例如：

word2idx

按字分割字符串

0 个答案: