Question

我目前正在实施一个分类器，将视频脚本分类为类别（TECH，TRAVEL，...）。

这是数据集中类别的重新划分。

没有太多的数据，也没有很好的平衡。

对classifier.py的处理方法如下：

df = pd.read_csv('./rows.csv')
df.columns = ['ID', 'CATEGORY', 'TRANSCRIPT']
df = df[pd.notnull(df['TRANSCRIPT'])]
df['CATEGORY_ID'] = df['CATEGORY'].factorize()[0]

print(df.shape)

category_id_df = df[['CATEGORY', 'CATEGORY_ID']].drop_duplicates()
id_to_category = dict(category_id_df[['CATEGORY_ID', 'CATEGORY']].values)

def preprocess_text(sen):
    sentence = re.sub('[^a-zA-Z]', ' ', sen)
    sentence = re.sub(r"\s+[a-zA-Z]\s+", '', sentence)
    sentence = re.sub(r'\s+', ' ', sentence)
    return sentence

X = []
sentences = list(df['TRANSCRIPT'])
for sen in sentences:
    X.append(preprocess_text(sen))

y = df['CATEGORY']

label_encoder = preprocessing.LabelEncoder()

y = label_encoder.fit_transform(y)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=42)

y_train = to_categorical(y_train)
y_test = to_categorical(y_test)

tokenizer = Tokenizer(num_words=5000)
tokenizer.fit_on_texts(X_train)

X_train = tokenizer.texts_to_sequences(X_train)
X_test = tokenizer.texts_to_sequences(X_test)

vocab_size = len(tokenizer.word_index) + 1

maxlen = 200

X_train = pad_sequences(X_train, padding='post', maxlen=maxlen)
X_test = pad_sequences(X_test, padding='post', maxlen=maxlen)

embeddings_dictionary = dict()

glove_file = open('./glove.6B.100d.txt')
for line in glove_file:
    records = line.split()
    word = records[0]
    vector_dimensions = asarray(records[1:], dtype='float32')
    embeddings_dictionary [word] = vector_dimensions

glove_file.close()

embedding_matrix = zeros((vocab_size, 100))
for word, index in tokenizer.word_index.items():
    embedding_vector = embeddings_dictionary.get(word)
    if embedding_vector is not None:
        embedding_matrix[index] = embedding_vector

deep_inputs = Input(shape=(maxlen,))
embedding_layer = Embedding(vocab_size, 100, weights=[embedding_matrix], trainable=False)(deep_inputs)
LSTM_Layer_1 = LSTM(128)(embedding_layer)
dense_layer_1 = Dense(len(category_to_id), activation='softmax')(LSTM_Layer_1)
model = Model(inputs=deep_inputs, outputs=dense_layer_1)

model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['acc'])

model.fit(X_train, y_train, batch_size=128, epochs=10, verbose=1, validation_split=0.2)

score = model.evaluate(X_test, y_test, verbose=1)

print("Test Score:", score[0])
print("Test Accuracy:", score[1])

# serialize model to JSON
model_json = model.to_json()
with open("./samples/categorization/model.json", "w") as json_file:
    json_file.write(model_json)
# serialize weights to HDF5
model.save_weights("./samples/categorization/model.h5")
print("Saved model to disk")

输出如下：

(4294, 4)
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
input_1 (InputLayer)         (None, 200)               0         
_________________________________________________________________
embedding_1 (Embedding)      (None, 200, 100)          19104900  
_________________________________________________________________
lstm_1 (LSTM)                (None, 128)               117248    
_________________________________________________________________
dense_1 (Dense)              (None, 13)                1677      
=================================================================
Total params: 19,223,825
Trainable params: 118,925
Non-trainable params: 19,104,900
_________________________________________________________________
None
Train on 3091 samples, validate on 773 samples
Epoch 1/10
3091/3091 [==============================] - 14s 4ms/step - loss: 1.6889 - acc: 0.5341 - val_loss: 1.3790 - val_acc: 0.5627
Epoch 2/10
3091/3091 [==============================] - 12s 4ms/step - loss: 1.2055 - acc: 0.6299 - val_loss: 1.0348 - val_acc: 0.6740
Epoch 3/10
3091/3091 [==============================] - 12s 4ms/step - loss: 1.0078 - acc: 0.6930 - val_loss: 0.9385 - val_acc: 0.7115
Epoch 4/10
3091/3091 [==============================] - 12s 4ms/step - loss: 0.8797 - acc: 0.7302 - val_loss: 0.8864 - val_acc: 0.7348
Epoch 5/10
3091/3091 [==============================] - 12s 4ms/step - loss: 0.8152 - acc: 0.7473 - val_loss: 0.8713 - val_acc: 0.7439
Epoch 6/10
3091/3091 [==============================] - 12s 4ms/step - loss: 0.7962 - acc: 0.7557 - val_loss: 0.8217 - val_acc: 0.7516
Epoch 7/10
3091/3091 [==============================] - 13s 4ms/step - loss: 0.7462 - acc: 0.7567 - val_loss: 0.8110 - val_acc: 0.7529
Epoch 8/10
3091/3091 [==============================] - 12s 4ms/step - loss: 0.7363 - acc: 0.7619 - val_loss: 0.8692 - val_acc: 0.7400
Epoch 9/10
3091/3091 [==============================] - 13s 4ms/step - loss: 0.6828 - acc: 0.7849 - val_loss: 0.7748 - val_acc: 0.7581
Epoch 10/10
3091/3091 [==============================] - 13s 4ms/step - loss: 0.6379 - acc: 0.7881 - val_loss: 0.6749 - val_acc: 0.7840
430/430 [==============================] - 0s 1ms/step
Saved model to disk
Test Score: 0.6955041333686474
Test Accuracy: 0.7790697813034058

我还有另一个脚本predictor.py，我在其中加载先前保存的模型以对新笔录进行预测。

with open('./model.json','r') as f:
    model_json = json.load(f)

model = model_from_json(json.dumps(model_json))
model.load_weights('./model.h5')

def preprocess_text(sen):
    sentence = re.sub('[^a-zA-Z]', ' ', sen)
    sentence = re.sub(r"\s+[a-zA-Z]\s+", '', sentence)
    sentence = re.sub(r'\s+', ' ', sentence)
    return sentence

def to_prediction_data(transcripts):
    data = []

    tokenizer = Tokenizer(num_words=5000)

    maxlen = 200
    for sen in transcripts:
        data.append(preprocess_text(sen))

    data = tokenizer.texts_to_sequences(data)
    data = pad_sequences(data, padding='post', maxlen=maxlen)
    return data

id_to_category = {0: 'INTERNATIONAL', 1: 'BEAUTY', 2: 'ENTERTAINMENT', 3: 'MOTIVATIONAL', 4: 'TECH', 5: 'BUSINESS', 6: 'TRAVEL', 7: 'ENGLISH', 8: 'TED', 9: 'TALK SHOW', 10: 'LIFE', 11: 'ACTIVITY', 12: 'INTERVIEW'}

new_transcripts = ["""hey everybody meet one of my staff members diana chang hey Diana hey you're getting your driver's license I thought because I like to help my staff members I'd take you out give you some of my pointers because I've been driving for a long time I got my license when I was 16 and that was well how old do you think I am like 50 50 something what you are right that's not the point you look you look like you're in you're at most 40s we need to do is look the camera and say it's plausible plausible that I'm in my early 30s it's plausible that Conan is in his early 30s it that's me so that's my break right yeah cool huh I have my own break should use both hands at all times yes see I think that's not necessary my opinion yeah makes it hard to text makes it hard to make see your hands I'm never is it tending to they're supposed to be on actually since airbags came out we learned that you want to drive with the lower no one told me that well you might have learned driving before airbags came out [Music] will you just let that guy pass you and so now it's like you're a snitch so you've got to be like no I'm not your bitch you're my bitch and then you accelerate and cut him off okay go faster pass him I'm trying now yo out the window you're my bitch go yo yeah you're my bitch pretty doing the right thing here you're letting people cross I think these couple right here is milking it you know 80 80 is when people take their time constantly look at your window so the next time you pass someone they can hear you when you say you're my bitch okay go you're my bitch that was sad there was an older Asian couple that you yell did you feel bad doing that okay and just glared in the window Apple okay look now they're coming up they just passed you you just got schooled by a sienna mini man you're doing great Diana but we're gonna up the ante now I want you to learn what it's like to drive when there are other passengers in the car and maybe things get a little distracting so we're gonna swing by over here and pick up two of my friends you cool with that okay we might need you to do a drive-by so you gotta know how to what thug life no seatbelts not to me hey man fuckin hey hey Tupac was one of the biggest thugs I know and he always works yeah was he cool he was coming at door keys like Burkle door key I mean he was just like Kevin Hart I'm one of the coolest guys I'm Diana am i cool yeah you're the fun your friends are gonna not put me in the friend zone when they see me first thing that come to mind is I wanna bump bellies when you scored I haven't been in zone lots of times and let me tell you somethin I get in there I dance we dance you know what I'm talking about no I don't you're talking about get the metaphor you didn't get it yeah Widow we're talking to a nighttime walker my nighttime wanted this pudding bout right here see how much money I'm a prostitute you're gonna turn me out let it out two hours you have my trap look I honestly don't know what you're talking about I honestly I'm not even doing a bit here I don't know what your travel money teaching Diana to drive turn into I'm a male prostitute you're gonna put me out and you're gonna come back in an hour and you want your traffic watching don't let it get over your lane you know I still I used to have a tray full of pennies like in my older what are you that's ridiculous here's a penny there you go is this person cutting you off [Music] okay you just really don't know what to do died in it I got to find somebody to get mad at right here slow down okay this penny has your name on it man this is Diana drivin just driving man power windows is a big mistake I can't do this all day if you know spooky business there you go doing that good we looking for Mara one now what do you say Pedro and what 1212 Pedro into what I think he was talking about San Pedro 12 what is it two different places you not even drive why you assuming that Cube knows more about this neighborhood than I do why aren't you asking me about this neighborhood yeah you're absolutely right about that that's racist a little racist towards me active it is it Pedro Pedro I'm not sure I've never been down here and I'm terrified yeah we're gonna bite wait have you done crack code through your butt Kevin how that conversation go yeah man this is taking too long do me a favor blow it in my butt don't follow rules my thing rules are for bitches yeah and you know what bitches get stitches you say this sounds like an announcement bitches get stitches I thought I sounded pretty good bingo Carla that bet you were wild in your college game oh man I was balls to the wall I put it up there and I saw it would stick and I rolled that onion all the way down I tried everything everyone dudes okay I did it all he was a male prostitute paid off my student loan no no we're gonna get you guys a pinata don't settle no ass why you out common is dress like a male closet I got you this pinata colder that's okay oh do you bet the cube I don't look pinatas around here's Bush what's your favorite fast food I almost married you just how much wood could a woodchuck chuck would you take that into the studio that goes the number one not bad it is what the kids are listening to now Naughty by Nature OPP how can I explain it I'll take it frame by frame oh my goodness that's what's happening now what we see how oh good what's that a dispensary oh are you serious yeah that's great we got a license she's normally used to putting drugs in her but it's called the prison wallet yeah you like sour patch kids Conan yeah love so yeah there you go get these I want kilos or sour patch kids okay could you do me a favor I want to fill this with we'd just start stacking them do you have any tape [Music] [Applause] [Music] [Applause] [Music] same amount of smoke this is like a Cheech & Chong movie up through your window we're gonna die we're gonna kill a good time didn't you get a good career you're ready to call suddenly all my anxieties and fears are gone I broke the law in front of this policeman Goa cops hello sorry about this sir sir here's a good thing about me first and foremost I'm a Christian and what I learned is that Jesus once walked on water I don't know what I'm going with this guy thank you very much officer thank you won't happen again here po PO's better get going [Music] we got fake this chicken crack crack cocaine right here I wonder if you could take this up the but the but I would do it that's only come on do it Diana you are ready for your driver's test isn't you ready now guys you are listening the destructor says anything that you don't like you thought battery anything [Applause] you"""]

data = to_prediction_data(new_transcripts)
result = model.predict(data)

for predictions in result:
    for index, prediction in enumerate(predictions):
        print(f'{prediction} -> {id_to_category[index]}')
    print('\n\n')

预测输出为：

0.0010040111374109983 -> INTERNATIONAL
0.0039509437046945095 -> BEAUTY
0.002164274686947465 -> ENTERTAINMENT
0.000558208383154124 -> MOTIVATIONAL
0.9019364714622498 -> TECH
0.00824095867574215 -> BUSINESS
0.0002612635726109147 -> TRAVEL
0.002875642152503133 -> ENGLISH
0.00045320799108594656 -> TED
0.05539935454726219 -> TALK SHOW
0.020364508032798767 -> LIFE
0.00028050938271917403 -> ACTIVITY
0.0025106456596404314 -> INTERVIEW

我一直在尝试其他成绩单，无论如何，所有预测都趋向于TECH类。在prediction.py脚本中，我不确定如何处理Tokenizer。在分类器中，我在所有数据集上使用tokenizer.fit_on_texts(X_train)，但是现在在prediction.py上使用，因为已经创建了模型，是否应该再次使用fit_on_texts？老实说，我不确定问题是否出在这里，因为我尝试在classifier.py上创建模型后就对新的成绩单进行了预测，而没有做任何更改。

对此有任何提示或建议吗？

Keras多分类错误预测

0 个答案: