假设我有一个数据框,其中包含两个字段,例如“票证”和“类别”,两者都是文本输入,但我将“类别”转换为整数,现在我想将其拆分为测试和训练集,并上传到Sagemaker训练模型中< / p>
X_train, X_test, y_train, y_test = model_selection.train_test_split(fewRecords['Ticket'],fewRecords['Category'])
现在,我要执行TD-IDF特征提取,然后将其转换为数值,因此执行此操作
tfidf_vect = TfidfVectorizer(analyzer='word', token_pattern=r'\w{1,}', max_features=5000)
tfidf_vect.fit(fewRecords['Category'])
xtrain_tfidf = tfidf_vect.transform(X_train)
xvalid_tfidf = tfidf_vect.transform(X_test)
此处是类型
的详细信息type(xtrain_tfidf)
# scipy.sparse.csr.csr_matrix
y_train.dtype
# dtype('float32')
现在我正在尝试使用此代码将其上传到sagemaker中
buf = io.BytesIO()
smac.write_spmatrix_to_sparse_tensor(buf, xtrain_tfidf, y_train)
buf.seek(0)
出现此错误
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-239-53bd10157df7> in <module>()
1 buf = io.BytesIO()
----> 2 smac.write_spmatrix_to_sparse_tensor(buf, xtrain_tfidf, y_train)
3 buf.seek(0)
4
5
~/anaconda3/envs/python3/lib/python3.6/site-packages/sagemaker/amazon/common.py in write_spmatrix_to_sparse_tensor(file, array, labels)
143 # Write labels
144 if labels is not None:
--> 145 _write_label_tensor(resolved_label_type, record, labels[row_idx])
146
147 # Write shape
~/anaconda3/envs/python3/lib/python3.6/site-packages/pandas/core/series.py in __getitem__(self, key)
621 key = com._apply_if_callable(key, self)
622 try:
--> 623 result = self.index.get_value(self, key)
624
625 if not is_scalar(result):
~/anaconda3/envs/python3/lib/python3.6/site-packages/pandas/core/indexes/base.py in get_value(self, series, key)
2558 try:
2559 return self._engine.get_value(s, k,
-> 2560 tz=getattr(series.dtype, 'tz', None))
2561 except KeyError as e1:
2562 if len(self) > 0 and self.inferred_type in ['integer', 'boolean']:
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_value()
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_value()
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item()
KeyError: 1
当我尝试调试它时,从其GIT文件夹中复制源代码 https://github.com/aws/sagemaker-python-sdk/blob/master/src/sagemaker/amazon/common.py#L113
发现该方法破坏了代码
def _write_recordio(f, data):
"""Writes a single data point as a RecordIO record to the given file."""
length = len(data)
f.write(struct.pack('I', _kmagic))
f.write(struct.pack('I', length))
pad = (((length + 3) >> 2) << 2) - length
f.write(data)
print(padding)
print(pad)
print(padding[1])
f.write(padding[pad])
这是我添加的打印语句的输出,似乎文本中有一些空值,但我无法识别,请帮忙
{0: b'', 1: b'\x00', 2: b'\x00\x00', 3: b'\x00\x00\x00'}
1
b'\x00'