张量流中OrdinalEncoder的等效项是什么?

时间:2020-01-02 16:18:49

标签: tensorflow encode tensorflow-datasets

我的数据集中有一个包含分类字符串的特殊功能。这些值属于['a', 'ae', 'e', 'i', 'u']

但是,我想将这些字符映射为数字。请注意,我使用的是tensorflow数据集。

这是我的示例代码:

data_dir = "C:/Users/user/Documents/vowels/"

# I have data collected from 13 different subjects. Each time the data is recorded is considered one trial. In total we have 6 trials per subject.
# In this case, I used the first 5 trials for training and the 6th for testing/validation.
subjects_nums = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13]

trial_nums_train = [1, 2, 3, 4, 5]
trial_nums_test = [6]

paths_train = [data_dir + 'Col3/*/*_{}_trail_{}.png'.format(i, j) for i in subjects_nums for j in trial_nums_train]
paths_test = [data_dir + 'Col3/*/*_{}_trail_{}.png'.format(i, j) for i in subjects_nums for j in trial_nums_test]

list_ds_train = tf.data.Dataset.list_files(paths_train)
list_ds_test = tf.data.Dataset.list_files(paths_test)

# Here in this case, I did the conversion manually, on purpose for now. However, what if I don't know all the categories, or if I have 10s of them. I would like to convert the strings into numbers automatically.
def get_label(file_path):
    # convert the path to a list of path components
    parts = tf.strings.split(file_path, os.path.sep)
    # The second to last is the class-directory
    char = tf.strings.split(parts[-2], "_")[1]

    tensor = char
    if tensor == 'a':
        return 0
    elif tensor == 'ae':
        return 1
    elif tensor == 'e':
        return 2
    elif tensor == 'i':
        return 3
    else:
        return 4

def decode_img(img):
    # convert the compressed string to a 3D uint8 tensor
    img = tf.image.decode_jpeg(img, channels=3)
    # Use `convert_image_dtype` to convert to floats in the [0,1] range.
    img = tf.image.convert_image_dtype(img, tf.float32)
    # resize the image to the desired size.
    return img


def process_path(file_path):
    label = get_label(file_path)
    # load the raw data from the file as a string
    img = tf.io.read_file(file_path)
    img = decode_img(img)
    return img, label

# Use Dataset.map to create a dataset of image, label pairs:
# Set `num_parallel_calls` so multiple images are loaded/processed in parallel.
AUTOTUNE = tf.data.experimental.AUTOTUNE
labeled_ds_train = list_ds_train.map(process_path, num_parallel_calls=AUTOTUNE)
labeled_ds_test = list_ds_test.map(process_path, num_parallel_calls=AUTOTUNE)

labeled_ds_train = labeled_ds_train.cache().shuffle(buffer_size=1000).batch(32).prefetch(AUTOTUNE)
labeled_ds_test = labeled_ds_test.cache().batch(32).prefetch(AUTOTUNE)

然后检查数据集包含的内容:

for image, label in labeled_ds_train.take(1):
    print("Image shape: ", image.numpy().shape)
    print("Label: ", label.numpy())

我知道了:

Image shape:  (32, 130, 267, 3)
Label:  [b'ae' b'u' b'a' b'e' b'i' b'ae' b'i' b'e' b'e' b'a' b'i' b'a' b'i' b'a' b'i' b'u' b'u' b'ae' b'e' b'a' b'e' b'ae' b'a' b'i' b'i' b'e' b'ae' b'i' b'i' b'e' b'e' b'i']

我想要一种简单的方法,可以随时或自动将字符串转换为数字。

那怎么可能?

同样,首先,我有一个名为vowels的根文件夹,然后有一个名为Col3和Col4的子文件夹。然后,它们每个都包含子文件夹vowel_a,vowel_ae,vowel_e,vowel_i和vowel_u。然后将图像存储在后面的子文件夹中。图片名称如下:subject _ {} trial {}。png;其中第一个持有人反映主题编号,第二个持有人反映主题审判。

非常感谢您的帮助

0 个答案:

没有答案