我有一个句子列表,其中每个句子都按单词分开。
我的意思是,id = $_POST["id"];
a = $_POST["a"];
b = $_POST["b"];
c = $_POST["c"];
看起来像这里:
sentences
每个句子都有一定长度,所以我发现[['word0', 'word1'], ['word2', 'word3', 'word4', 'word5'],
['word6', 'word7', 'word8'], ....]
就像
max_sentence_len
我想要一个矩阵和数组。让我们来听听max_sentence_len=max(max_sentence_len, len(current_sentence))
:
['word2, 'word3', 'word4', 'word5']
因此在矩阵中将是:
'word2' ---> 'word3'
'word2' 'word3' ---> 'word4'
'word2' 'word3' 'word4' ---> 'word5'
所有句子都要做!
首先,我要计算矩阵中有多少行:
matrix[0, 0] = 'word2' ---> array[0] = 'word3'
matrix[1 0] = 'word2', matrix[1 1] = 'word3' ---> array[1] = 'word4'
....
比我所解释的做矩阵和数组:
summ = 0
for line in sentences:
summ += len(line)-1
train_x = np.zeros([summ, max_sentence_len], dtype=np.int32)
train_y = np.zeros([summ], dtype=np.int32)
ind = 0
for sentence in sentences:
for i in range(len(sentence)-1):
for j in range(i+1):
train_x[ind, j] = word2idx(sentence[j])
train_y[ind] = word2idx(sentence[i+1])
ind += 1
print('train_x shape:', train_x.shape)
print('train_y shape:', train_y.shape)
仅在词汇表中给出单词索引。
效果很好!但是太长了(例如,如果word2idx
更多,summ
)
有什么方法可以更快地做到这一点?
UPD:
为了更好地理解,让我们看一下示例案例。让我们使用以下句子:“铅笔是红色的”,“它们很奇怪”,“没有遗产比诚实更丰富”。这样我的630000
将是:
sentences
下一步(创建矩阵和数组):
[['the', 'pencil', 'is', 'red'],
['they', 'are', 'strange'],
['no', 'legacy', 'is', 'so', 'rich', 'as', 'honesty']]
因此,总结train_x --> train_y:
the --> pencil
the pencil --> is
the pencil is --> red
they --> are
the are --> strange
no --> legacy
no legacy --> is
no legacy is --> so
no legacy is so --> rich
no legacy is so rich --> as
no legacy is so rich as --> honesty
将是:
train_x
max_sentence_len = 7
[ [the 0 0 0 0 0 0]
[the pencil 0 0 0 0 0 0]
[the pencil is 0 0 0 0]
[they 0 0 0 0 0 0]
[they are 0 0 0 0 0]
[no 0 0 0 0 0 0]
[no legacy 0 0 0 0 0]
[no legacy is 0 0 0 0]
[no legacy is so 0 0 0]
[no legacy is so rich 0 0]
[no legacy is so rich as 0] ]
:
train_y
当然,矩阵和数组中的词并不完全-它们在词汇表([pencil, is, red, are, strange, legacy, is, so, rich, as, honesty ]
中建立索引
创建词汇表可以很简单,例如:
word2idx
和vocab = []
for sentence in sentences:
for word in sentence:
if word not in vocab:
vocab.append(word)
很简单,例如:
word2idx