Question

我有一个用于导入包含约500,000个100 bp DNA字符串的数据集的代码。我正在尝试将字符串热编码为二维矩阵。该代码可以对整个数据集的一小部分进行编码，该子集包含999个字符串，但是不会对整个数据集进行编码。我确定所有的琴弦都具有相同的形状。

我确保数据集由统一的100 bp序列组成，不多也不少。我还删除了任何不包含“ A”，“ G”，“ C”或“ T”的序列。该代码适用于数据的一小部分子集，但即使形状相同，也不会对整个数据集进行编码。

from sklearn.preprocessing import LabelEncoder, OneHotEncoder
integer_encoder = LabelEncoder()  
one_hot_encoder = OneHotEncoder(categories='auto')   
input_features = []

for sequence in sequences:
  integer_encoded = integer_encoder.fit_transform(list(sequence))
  integer_encoded = np.array(integer_encoded).reshape(-1, 1)
  one_hot_encoded = one_hot_encoder.fit_transform(integer_encoded)
  input_features.append(one_hot_encoded.toarray())

np.set_printoptions(threshold=40)
input_features = np.stack(input_features)

使用999个字符串的小型数据集时的代码将为我提供以下输出：

One hot encoding of Sequence #1:
 [[1. 1. 0. ... 1. 0. 1.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 1. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 1. 0.]]

但是，当我尝试完整的数据集时，它将引发此错误：

ValueError: all input arrays must have the same shape

代码适用于小型数据集，但不适用于具有相同形状的较大数据集

0 个答案: