我正在建立的程序是尝试对有关航空公司不幸的推文进行精确的情绪分析。应该发送一条推文,将其预测为中立,正面或负面。如果该推文被预测为负面,则可以预测负面原因是否归因于客户服务。
我已经训练了神经网络来做到这一点。我现在想要做的是将原始数据转换为神经网络可以理解的格式。这两个神经网络的字典以频率分布的形式不同。
我需要获取原始的tweet(在这种情况下,只是一个字符串),删除无意义的单词,然后将清理后的tweet输入到两个导入的数据帧中,以输出神经网络可以理解的数字。
导入的数据框的格式为:
word Frequency word_index
0 flight 4742 1
1 wa 1670 2
2 thank 1666 3
3 get 1606 4
4 thi 1369 5
5 http 1200 6
6 hour 1125 7
7 help 1038 8
8 cancel 1034 9
9 servic 985 10
10 delay 980 11
11 time 952 12
12 custom 928 13
13 call 769 14
232 terribl 108 231
387 aw 65 386
468 absolut 52 467
483 hate 49 482
tweet预处理为:
The customer service was awful. Absolutely terrible. I hated it.
tweet后处理是:
words
0 custom
1 servic
2 wa
3 aw
4 absolut
5 terribl
6 hate
我需要该推文映射到:
[9, 10, 2, 386, 467, 231, 482]
除了“搜索并替换”功能外,我还尝试了map()函数,但无论如何都找不到。
import numpy as np
import pandas as pd
import nltk
import tensorflow as tf
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords
from sklearn.model_selection import train_test_split
from tensorflow import keras
nltk.download('stopwords')
nltk.download('punkt')
WordDistInit = pd.read_csv('WordDistDenseInit.csv')
WordDistSeco = pd.read_csv('WordDistDenseSeco.csv')
WordDistInit.columns=['word', 'Frequency', 'word_index']
WordDistSeco.columns=['word', 'Frequency', 'word_index']
print("Data loaded")
#DOWNLOAD TWEET HERE
tweet=["The customer service was awful. Absolutely terrible. I hated it."]
#Finds word stems -- Running, runner, run - > Run.
stemming = PorterStemmer()
#Removes "stopwords", words that generally don't add anything to a sentence. e.g. "The".
stops = set(stopwords.words("english"))
#Defines enclosing cleaning function. May optimize in future.
def apply_cleaning_function_to_list(X):
cleaned_X = []
for element in X:
cleaned_X.append(clean_text(element))
return cleaned_X
#Defines nested cleaning function. Actually does the "work".
def clean_text(raw_text):
#Removes uppercase.
text = raw_text.lower()
#Converts text into list of words.
tokens = nltk.word_tokenize(text)
#Removes puncuation and numbers.
token_words = [w for w in tokens if w.isalpha()]
#Stems words.
stemmed_words = [stemming.stem(w) for w in token_words]
#Removes stopwords.
meaningful_words = [w for w in stemmed_words if not w in stops]
#Returns cleaned data.
return meaningful_words
print("Cleanup function defined without error.")
#Define text-to-clean
text_to_clean = tweet
#Cleans text
tweet_cleaned = apply_cleaning_function_to_list(text_to_clean)
#Flattens list
flat_tweet = [item for sublist in tweet_cleaned for item in sublist]
#create new df
df_init, df_seco = pd.DataFrame({'words':flat_tweet}), pd.DataFrame({'words':flat_tweet})
print("Text-clean function called without error.")
下一个代码块中需要转换后的tweet-由第一个神经网络进行分析。
我想不起来。我在网上找到了其他仅用于数字的示例,但无法使其适应字符串和数字的组合。
谢谢。
编辑:
我已经将推文转换为字符串:
Y = ', '.join(flat_tweet)
print(Y)
custom, servic, wa, aw, absolut, terribl, hate
并且我已经将WordDistInit变成了字典,并删除了整数并用字符串替换了它们:
#Function to convert text to numbers.
#Cutoff_for_rare_words removes words only used once.
def training_text_to_numbers(text, cutoff_for_rare_words = 1):
#Convert Pandas format to dictionary
word_dict = WordDistInit.word.to_dict()
#Converts ints into strs
word_dict = dict((str(k),v) for k,v in word_dict.items())
这意味着我有以下格式的字典:
{'0': 'flight', '1': 'wa', '2': 'thank', '3': 'get', '4': 'thi', '5': 'http', '6': 'hour', '7': 'help', '8': 'cancel', '9': 'servic', '10': 'delay', '11': 'time', '12': 'custom', '13': 'call', '14': 'bag', '15': 'wait', '16': 'plane', '17': 'need', '18': 'fli', '19': 'hold', '20': 'amp', '21': 'us', '22': 'go', '23': 'would', '24': 'whi', '25': 'tri', '26': 'one', '27': 'still', '28': 'pleas', '29': 'airlin', '30': 'day', '31': 'ca', '32'...}
如果我可以将所有Y
变量词映射到相应的(string)整数,那应该没问题。
我尝试过此方法,但找不到解决方法。
我现在不关心效率,我只想证明这一概念。我现在完全迷住了。我不知道如何继续。
再次感谢。
答案 0 :(得分:0)
以下是通常可以解决我问题的示例代码。现在唯一的问题是,如果输入了词典中没有的值,则会引发错误,但是可以解决。
lookup = {'0': 'flight', '1': 'wa', '2': 'thank', '3': 'get', '4': 'thi', '5': 'http', '6': 'hour', '7': 'help', '8': 'cancel', '9': 'servic', '10': 'delay', '11': 'time', '12': 'custom', '13': 'call', '14': 'bag', '15': 'wait', '16': 'plane', '17': 'need', '18': 'fli', '19': 'hold', '20': 'amp', '21': 'us', '22': 'go', '23': 'would', '24': 'whi', '25': 'tri', '26': 'one', '27': 'still', '28': 'pleas', '29': 'airlin', '30': 'day', '31': 'ca'}
flipped_lookup = {v:k for k,v in lookup.items()}
string_to_analyze = "custom, servic, wa, aw, absolut, terribl, hate"
list_to_analyze = [w.strip() for w in string_to_analyze.split(',')]
analyze_value_list = [int(flipped_lookup[w]) for w in list_to_analyze]