LSTM项目与CSV格式不兼容

时间:2017-07-10 07:45:42

标签: python python-3.x csv machine-learning tensorflow

我正在尝试复制Chevalier的LSTM Human Activity Recognition算法,并在尝试以CSV格式实现自己的数据时遇到了问题。 git中使用的格式是txt。我的CSV数据格式如下:

>>> dir(result)
['__bool__', '__class__', '__delattr__', '__dir__', '__doc__', '__eq__', '__form
at__', '__ge__', '__getattribute__', '__gt__', '__hash__', '__init__', '__le__',
 '__lt__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__se
tattr__', '__sizeof__', '__str__', '__subclasshook__']

可以找到原始文件here。 x值(时间)在列0(-80.060003等)中,y值(值)在列1(8,8等)中。我试着用pandas

0.000995,8
0.020801,8
0.040977,8
0.060786,8
0.080970,8
...            ...

但它似乎与"准备数据集"中的数据格式不兼容。部分(也可能是其他部分):

pandas.read_csv(DATASET_PATH + TRAIN + "data_train.csv", skiprows=1, header=None, sep=',', usecols=[0, 1])

这是我通过iPython3实现的:

在[0]:

TRAIN = "train/"
TEST = "test/"

# Load "X" (the neural network's training and testing inputs)

def load_X(X_signals_paths):
    X_signals = []

    for signal_type_path in X_signals_paths:
        file = open(signal_type_path, 'r')
        # Read dataset from disk, dealing with text files' syntax
        X_signals.append(
            [np.array(serie, dtype=np.float32) for serie in [
                row.replace('  ', ' ').strip().split(' ') for row in file
            ]]
        )
        file.close()

    return np.transpose(np.array(X_signals), (1, 2, 0))

X_train_signals_paths = [
    DATASET_PATH + TRAIN + "Inertial Signals/" + signal + "train.txt" for signal in INPUT_SIGNAL_TYPES
]
X_test_signals_paths = [
    DATASET_PATH + TEST + "Inertial Signals/" + signal + "test.txt" for signal in INPUT_SIGNAL_TYPES
]

X_train = load_X(X_train_signals_paths)
X_test = load_X(X_test_signals_paths)


# Load "y" (the neural network's training and testing outputs)

def load_y(y_path):
    file = open(y_path, 'r')
    # Read dataset from disk, dealing with text file's syntax
    y_ = np.array(
        [elem for elem in [
            row.replace('  ', ' ').strip().split(' ') for row in file
        ]], 
        dtype=np.int32
    )
    file.close()

    # Substract 1 to each output class for friendly 0-based indexing 
    return y_ - 1

y_train_path = DATASET_PATH + TRAIN + "y_train.txt"
y_test_path = DATASET_PATH + TEST + "y_test.txt"

y_train = load_y(y_train_path)
y_test = load_y(y_test_path)

输出[0]:

TRAIN = "train/"
TEST = "test/"

def load_X(X_signals_paths):
    X_signals = []
    for signal_type_path in X_signals_paths:
        file = pandas.read_csv(DATASET_PATH + TRAIN + "data_train.csv", skiprows=1, header=None, sep=',', usecols=[0])
        X_signals.append(
            [np.array(serie, dtype=np.float32) for serie in [
                str(row).replace('  ', ' ').strip().split(' ') for row in file
            ]]
        )

    return np.transpose(np.array(X_signals), (1, 2, 0))

_train_signals_paths = [
    DATASET_PATH + TRAIN + signal + "train.csv" for signal in INPUT_SIGNAL_TYPES
]
X_test_signals_paths = [
    DATASET_PATH + TEST + signal + "test.csv" for signal in INPUT_SIGNAL_TYPES
]

X_train = load_X(X_train_signals_paths)
X_test = load_X(X_test_signals_paths)
print(X_train, X_test)

我希望我可以通过正确格式化数据来获得一些帮助,以便与此算法无缝协作。如果有任何问题,请告诉我。

1 个答案:

答案 0 :(得分:0)

跟踪中的代码与您在问题中实际发布的代码不同 - 工作代码在裸文件句柄上运行,而不是在Pandas数据框上运行。

作为参考,这里是您再次提到的项目的代码:

def load_X(X_signals_paths):
    X_signals = []

    for signal_type_path in X_signals_paths:
        file = open(signal_type_path, 'r')
        # ^ the error comes where you have file = pandas.read_csv(...)

        # Read dataset from disk, dealing with text files' syntax
        X_signals.append(
            [np.array(serie, dtype=np.float32) for serie in [
                row.replace('  ', ' ').strip().split(' ') for row in file
            ]]
        )
        file.close()

file只是一个迭代器,它返回一个以换行符结尾的原始行(一系列字符);在这个输入上,剥离换行和挤压空间是有意义的。但是您的代码已经打开,解析并将文件内容重新格式化为Pandas数据框,该数据框没有换行符或空格,只有已解析的数字。也许回到上游代码;或者如果你想在那里改变某些东西,弄清楚如何询问这种变化。这样的CSV没有任何问题。

Python有一个非常强大的csv module所以可能只是使用它而不是手动解析CSV中的各个字段。

    for signal_type_path in X_signals_paths:
        with open(signal_type_path, 'r') as csvfile:
            reader = csv.reader(csvfile)
            X_signals.append([np.array(row[0:2], dtype=np.float32) for row in reader])

或者作为最小的改变,用逗号分隔而不是空格。 (您的数据看起来实际上不需要删除空格。)

此外,切向地,您的代码对其读取的文件进行了硬编码。将DATASET_PATHTRAIN参数完全保留在调用代码中可能更好,让load_X只接受完整文件路径列表,它接受而不修改它们无论如何。