我正在研究分类问题,其中我有一个字符串列表作为类标签,我想将它们转换为张量。到目前为止,我已尝试使用numpy模块提供的numpy array
函数将字符串列表转换为np.array
。
truth = torch.from_numpy(np.array(truths))
但是我收到以下错误。
RuntimeError: can't convert a given np.ndarray to a tensor - it has an invalid type. The only supported types are: double, float, int64, int32, and uint8.
有人可以提出另一种方法吗?感谢
答案 0 :(得分:2)
如果您不想使用sklearn,另一种解决方案是保留原始列表并创建一个额外的索引列表,您可以在以后使用该列表返回原始值。当我在批处理标记化字符串时必须跟踪原始字符串时,我特别需要此功能。
以下示例:
labels = ['cat', 'dog', 'mouse']
sentence_idx = np.linspace(0,len(labels), len(labels), False)
# [0, 1, 2]
torch_idx = torch.tensor(sentence_idx)
# do what ever you would like from torch eg. pass it to a dataloader
dataset = TensorDataset(torch_idx)
loader = DataLoader(dataset, batch_size=1, shuffle=True)
for batch in iter(loader):
print(batch[0])
print(labels[int(batch[0].item())])
# output:
# tensor([0.], dtype=torch.float64)
# cat
# tensor([1.], dtype=torch.float64)
# dog
# tensor([2.], dtype=torch.float64)
# mouse
对于我的特定用例,代码如下所示:
input_ids, attention_masks, labels = tokenize_sentences(tokenizer, sentences, labels, max_length)
# create a indexes tensor to keep track of original sentence index
sentence_idx = np.linspace(0,len(sentences), len(sentences),False )
torch_idx = torch.tensor(sentence_idx)
dataset = TensorDataset(input_ids, attention_masks, labels, torch_idx)
loader = DataLoader(dataset, batch_size=1, shuffle=True)
for batch in loader:
_, logit = model(batch[0],
token_type_ids=None,
attention_mask=batch[1],
labels=batch[2])
pred_flat = np.argmax(logit.detach(), axis=1).flatten()
print(pred_flat)
print(batch[2])
if pred_flat == batch[2]:
print("\nThe following sentence was predicted correctly:")
print(sentences[int(batch[3].item())])
答案 1 :(得分:1)
truth = [float(truths) for x in truths]
truth = np.asarray(truth)
truth = torch.from_numpy(truth)
答案 2 :(得分:1)
很遗憾,您现在不能。而且我认为这不是一个好主意,因为它会使PyTorch变得笨拙。一种流行的解决方法是使用sklearn将其转换为数字类型。
这是一个简短的示例:
from sklearn import preprocessing
import torch
labels = ['cat', 'dog', 'mouse', 'elephant', 'pandas']
le = preprocessing.LabelEncoder()
targets = le.fit_transform(labels)
# targets: array([0, 1, 2, 3])
targets = torch.as_tensor(targets)
# targets: tensor([0, 1, 2, 3])
由于您可能需要在真实标签和转换后的标签之间进行转换,因此最好存储变量le
。
答案 3 :(得分:1)
诀窍是首先在列表中找出单词的最大长度,然后在第二个循环中使用零填充填充张量。请注意,utf8字符串每个字符占用两个字节。
In[]
import torch
words = ['שלום', 'beautiful', 'world']
max_l = 0
ts_list = []
for w in words:
ts_list.append(torch.ByteTensor(list(bytes(w, 'utf8'))))
max_l = max(ts_list[-1].size()[0], max_l)
w_t = torch.zeros((len(ts_list), max_l), dtype=torch.uint8)
for i, ts in enumerate(ts_list):
w_t[i, 0:ts.size()[0]] = ts
w_t
Out[]
tensor([[215, 169, 215, 156, 215, 149, 215, 157, 0],
[ 98, 101, 97, 117, 116, 105, 102, 117, 108],
[119, 111, 114, 108, 100, 0, 0, 0, 0]], dtype=torch.uint8)