在Google趋势数据上使用简单的模型来预测某件事并不能按预期进行

时间:2019-08-29 11:40:05

标签: python pandas tensorflow keras google-trends

我正在使用Google趋势来开发一个简单的模型,以预测一组搜索字词的未来趋势。我从this blog post中汲取了灵感,并尝试对其他搜索字词进行相同的操作,试图为此类任务找到最佳的模型。


问题是:其他搜索词的预测是完全错误的。我只使用带有规则模式的术语,有时不如博客示例中的规则那么规则。这是我改编的代码:

import numpy as np
import pandas as pd
from datetime import date
from matplotlib import pyplot as plt
from keras.models import Sequential
from keras.layers import InputLayer, Reshape, Conv1D, MaxPool1D, Flatten, Dense, LSTM
from keras.callbacks import EarlyStopping, ModelCheckpoint
from pandas.plotting import register_matplotlib_converters
register_matplotlib_converters()



def prepare_data(target, window_X, window_y):
    """ Data preprocessing for multistep forecast """
    X, y = [], []
    start_X = 0
    end_X = start_X + window_X
    start_y = end_X
    end_y = start_y + window_y
    for _ in range(len(target)):
        if end_y < len(target):
            X.append(target[start_X:end_X])
            y.append(target[start_y:end_y])
        start_X += 1
        end_X = start_X + window_X
        start_y += 1
        end_y = start_y + window_y
    X = np.array(X)
    y = np.array(y)
    return np.array(X), np.array(y)


def fit_model(type, X_train, y_train, X_test, y_test, batch_size, epochs):
    """ Training function for network """
    # Model input
    model = Sequential()
    model.add(InputLayer(input_shape=(X_train.shape[1], )))

    if type == 'mlp':
        model.add(Reshape(target_shape=(X_train.shape[1], )))
        model.add(Dense(units=64, activation='relu'))

    if type == 'cnn':
        model.add(Reshape(target_shape=(X_train.shape[1], 1)))
        model.add(Conv1D(filters=64, kernel_size=4, activation='relu'))
        model.add(MaxPool1D())
        model.add(Flatten())

    if type == 'lstm':
        model.add(Reshape(target_shape=(X_train.shape[1], 1)))
        model.add(LSTM(units=64, return_sequences=False))

    # Output layer
    model.add(Dense(units=64, activation='relu'))
    model.add(Dense(units=y_train.shape[1], activation='sigmoid'))

    # Compile
    model.compile(optimizer='adam', loss='mse')

    # Callbacks
    early_stopping = EarlyStopping(monitor='val_loss', patience=10)
    model_checkpoint = ModelCheckpoint(filepath='model.h5', save_best_only=True)
    callbacks = [early_stopping, model_checkpoint]

    # Fit model
    model.fit(x=X_train, y=y_train, validation_data=(X_test, y_test),
              batch_size=batch_size, epochs=epochs, callbacks=callbacks, verbose=2)

    # Load best model
    model.load_weights('model.h5')

    # Return
    return model


# Define windows
window_X = 12
window_y = 6

# Load data
data = pd.read_csv('data/holocaust-world.csv', sep=',')
data = data.set_index(keys=pd.to_datetime(data['month']), drop=True).drop('month', axis=1)

# Scale data
data['y'] = data['y'] / 100.

# Prepare tensors
X, y = prepare_data(target=data['y'].values, window_X=window_X, window_y=window_y)

# Training and test
train = 100
X_train = X[:train]
y_train = y[:train]
X_valid = X[train:]
y_valid = y[train:]

# Train models
models = ['mlp', 'cnn', 'lstm']

# Test data
X_test = data['y'].values[-window_X:].reshape(1, -1)

# Predictions
preds = pd.DataFrame({'mlp': [np.nan]*6, 'cnn': [np.nan]*6, 'lstm': [np.nan]*6})
preds = preds.set_index(pd.date_range(start=date(2018, 11, 1), end=date(2019, 4, 1), freq='MS'))

# Fit models and plot
for mod in models:

    # Train models
    model = fit_model(type=mod, X_train=X_train, y_train=y_train, X_test=X_valid, y_test=y_valid, batch_size=16, epochs=1000)

    # Predict
    p = model.predict(x=X_test)

    # Fill
    preds[mod] = p[0]

# Plot the entire timeline, including the predicted segment
idx = pd.date_range(start=date(2004, 1, 1), end=date(2019, 4, 1), freq='MS')
data = data.reindex(idx)
plt.plot(data['y'], label='Google')

# Plot
plt.plot(preds['mlp'], label='MLP')
plt.plot(preds['cnn'], label='CNN')
plt.plot(preds['lstm'], label='LSTM')
plt.legend()
plt.show()

在这里,我尝试评估了对大屠杀主题的兴趣,大屠杀主题也是周期性的(在一月份达到峰值,您显然可以从Google趋势网站上获取csv)。结果如下: Results


所以问题是:

  • 我如何使该模型每月可用(在撰写本文时,直到2019年8月)?当我尝试执行此操作时,我的行为很怪异,所以我现在手动删除了2018年10月以后在csv中的所有内容。

  • 我如何改进那些简单的模型以实际给出有用和有意义的结果?我不知道为什么博客文章中的示例可以完美地工作,而我的尝试却以失败告终。

谢谢!

1 个答案:

答案 0 :(得分:1)

增加您测试的预测数,您应该获得更好的结果。

window_y = 49
...
# Predictions
preds = pd.DataFrame({'mlp': [np.nan]*49, 'cnn': [np.nan]*49, 'lstm': [np.nan]*49})
preds = preds.set_index(pd.date_range(start=date(2015, 1, 1), end=date(2019, 1, 1), freq='MS'))

玩训练/测试集也有帮助:

# Training and test
train = 50
X_train = X[:train]
y_train = y[:train]
X_valid = X[train:]
y_valid = y[train:]

enter image description here

但是,这种特定趋势是周期性的,但也会受到其他因素的影响。 Phrophet can help you dealing with this kind of trends比简单的机器学习模型更好。