按字分割字符串

时间:2019-02-28 15:46:22

标签: python python-3.x performance numpy

我有一个句子列表,其中每个句子都按单词分开。 我的意思是,id = $_POST["id"]; a = $_POST["a"]; b = $_POST["b"]; c = $_POST["c"]; 看起来像这里:

sentences

每个句子都有一定长度,所以我发现[['word0', 'word1'], ['word2', 'word3', 'word4', 'word5'], ['word6', 'word7', 'word8'], ....] 就像

max_sentence_len

我想要一个矩阵和数组。让我们来听听max_sentence_len=max(max_sentence_len, len(current_sentence))

['word2, 'word3', 'word4', 'word5']

因此在矩阵中将是:

'word2'  ---> 'word3'
'word2' 'word3'  ---> 'word4'
'word2' 'word3' 'word4' ---> 'word5'

所有句子都要做!

首先,我要计算矩阵中有多少行:

matrix[0, 0] = 'word2' ---> array[0] = 'word3'
matrix[1 0] = 'word2', matrix[1 1] = 'word3' ---> array[1] = 'word4'
....

比我所解释的做矩阵和数组:

summ = 0
for line in sentences:
    summ += len(line)-1

train_x = np.zeros([summ, max_sentence_len], dtype=np.int32) train_y = np.zeros([summ], dtype=np.int32) ind = 0 for sentence in sentences: for i in range(len(sentence)-1): for j in range(i+1): train_x[ind, j] = word2idx(sentence[j]) train_y[ind] = word2idx(sentence[i+1]) ind += 1 print('train_x shape:', train_x.shape) print('train_y shape:', train_y.shape) 仅在词汇表中给出单词索引。

效果很好!但是太长了(例如,如果word2idx更多,summ

有什么方法可以更快地做到这一点?

UPD: 为了更好地理解,让我们看一下示例案例。让我们使用以下句子:“铅笔是红色的”,“它们很奇怪”,“没有遗产比诚实更丰富”。这样我的630000将是:

sentences

下一步(创建矩阵和数组):

[['the', 'pencil', 'is', 'red'], 
 ['they', 'are', 'strange'],
 ['no', 'legacy', 'is', 'so', 'rich', 'as', 'honesty']]

因此,总结train_x --> train_y: the --> pencil the pencil --> is the pencil is --> red they --> are the are --> strange no --> legacy no legacy --> is no legacy is --> so no legacy is so --> rich no legacy is so rich --> as no legacy is so rich as --> honesty 将是:

train_x

max_sentence_len = 7 [ [the 0 0 0 0 0 0] [the pencil 0 0 0 0 0 0] [the pencil is 0 0 0 0] [they 0 0 0 0 0 0] [they are 0 0 0 0 0] [no 0 0 0 0 0 0] [no legacy 0 0 0 0 0] [no legacy is 0 0 0 0] [no legacy is so 0 0 0] [no legacy is so rich 0 0] [no legacy is so rich as 0] ] :      train_y

当然,矩阵和数组中的词并不完全-它们在词汇表([pencil, is, red, are, strange, legacy, is, so, rich, as, honesty ]中建立索引 创建词汇表可以很简单,例如:

word2idx

vocab = [] for sentence in sentences: for word in sentence: if word not in vocab: vocab.append(word) 很简单,例如:

word2idx

0 个答案:

没有答案