如何将数据框列映射到输入数据框列以进行转换

时间:2019-06-12 03:25:08

标签: python pandas transformation

我正在建立的程序是尝试对有关航空公司不幸的推文进行精确的情绪分析。应该发送一条推文,将其预测为中立,正面或负面。如果该推文被预测为负面,则可以预测负面原因是否归因于客户服务。

我已经训练了神经网络来做到这一点。我现在想要做的是将原始数据转换为神经网络可以理解的格式。这两个神经网络的字典以频率分布的形式不同。

我需要获取原始的tweet(在这种情况下,只是一个字符串),删除无意义的单词,然后将清理后的tweet输入到两个导入的数据帧中,以输出神经网络可以理解的数字。

导入的数据框的格式为:

                word        Frequency     word_index
0               flight      4742          1
1                   wa      1670          2
2                thank      1666          3
3                  get      1606          4
4                  thi      1369          5
5                 http      1200          6
6                 hour      1125          7
7                 help      1038          8
8               cancel      1034          9
9               servic       985         10
10               delay       980         11
11                time       952         12
12              custom       928         13
13                call       769         14
232            terribl       108         231
387                aw        65          386
468            absolut       52          467
483               hate       49          482 

tweet预处理为:

The customer service was awful. Absolutely terrible. I hated it.

tweet后处理是:

    words
0   custom
1   servic
2       wa
3       aw
4  absolut
5  terribl
6     hate

我需要该推文映射到:

[9, 10, 2, 386, 467, 231, 482]

除了“搜索并替换”功能外,我还尝试了map()函数,但无论如何都找不到。

import numpy as np
import pandas as pd
import nltk
import tensorflow as tf

from nltk.stem import PorterStemmer
from nltk.corpus import stopwords
from sklearn.model_selection import train_test_split
from tensorflow import keras

nltk.download('stopwords')
nltk.download('punkt')

WordDistInit = pd.read_csv('WordDistDenseInit.csv')
WordDistSeco = pd.read_csv('WordDistDenseSeco.csv')
WordDistInit.columns=['word', 'Frequency', 'word_index']
WordDistSeco.columns=['word', 'Frequency', 'word_index']
print("Data loaded")
#DOWNLOAD TWEET HERE
tweet=["The customer service was awful. Absolutely terrible. I hated it."]

#Finds word stems -- Running, runner, run - > Run.
stemming = PorterStemmer()
#Removes "stopwords", words that generally don't add anything to a sentence. e.g. "The".
stops = set(stopwords.words("english"))

#Defines enclosing cleaning function. May optimize in future.
def apply_cleaning_function_to_list(X):
    cleaned_X = []
    for element in X:
        cleaned_X.append(clean_text(element))
    return cleaned_X

#Defines nested cleaning function. Actually does the "work".
def clean_text(raw_text):
    #Removes uppercase.
    text = raw_text.lower()
    #Converts text into list of words.
    tokens = nltk.word_tokenize(text)
    #Removes puncuation and numbers.
    token_words = [w for w in tokens if w.isalpha()]
    #Stems words.
    stemmed_words = [stemming.stem(w) for w in token_words]
    #Removes stopwords.
    meaningful_words = [w for w in stemmed_words if not w in stops]
    #Returns cleaned data.
    return meaningful_words

print("Cleanup function defined without error.")

#Define text-to-clean
text_to_clean = tweet
#Cleans text
tweet_cleaned = apply_cleaning_function_to_list(text_to_clean)
#Flattens list
flat_tweet = [item for sublist in tweet_cleaned for item in sublist]

#create new df 
df_init, df_seco = pd.DataFrame({'words':flat_tweet}), pd.DataFrame({'words':flat_tweet})

print("Text-clean function called without error.")

下一个代码块中需要转换后的tweet-由第一个神经网络进行分析。

我想不起来。我在网上找到了其他仅用于数字的示例,但无法使其适应字符串和数字的组合。

谢谢。

编辑:

我已经将推文转换为字符串:

Y = ', '.join(flat_tweet)
print(Y)
custom, servic, wa, aw, absolut, terribl, hate

并且我已经将WordDistInit变成了字典,并删除了整数并用字符串替换了它们:

#Function to convert text to numbers.
#Cutoff_for_rare_words removes words only used once.
def training_text_to_numbers(text, cutoff_for_rare_words = 1):
    #Convert Pandas format to dictionary
    word_dict = WordDistInit.word.to_dict()
    #Converts ints into strs
    word_dict = dict((str(k),v) for k,v in word_dict.items())

这意味着我有以下格式的字典:

{'0': 'flight', '1': 'wa', '2': 'thank', '3': 'get', '4': 'thi', '5': 'http', '6': 'hour', '7': 'help', '8': 'cancel', '9': 'servic', '10': 'delay', '11': 'time', '12': 'custom', '13': 'call', '14': 'bag', '15': 'wait', '16': 'plane', '17': 'need', '18': 'fli', '19': 'hold', '20': 'amp', '21': 'us', '22': 'go', '23': 'would', '24': 'whi', '25': 'tri', '26': 'one', '27': 'still', '28': 'pleas', '29': 'airlin', '30': 'day', '31': 'ca', '32'...}

如果我可以将所有Y变量词映射到相应的(string)整数,那应该没问题。

我尝试过此方法,但找不到解决方法。

我现在不关心效率,我只想证明这一概念。我现在完全迷住了。我不知道如何继续。

再次感谢。

1 个答案:

答案 0 :(得分:0)

以下是通常可以解决我问题的示例代码。现在唯一的问题是,如果输入了词典中没有的值,则会引发错误,但是可以解决。

lookup = {'0': 'flight', '1': 'wa', '2': 'thank', '3': 'get', '4': 'thi', '5': 'http', '6': 'hour', '7': 'help', '8': 'cancel', '9': 'servic', '10': 'delay', '11': 'time', '12': 'custom', '13': 'call', '14': 'bag', '15': 'wait', '16': 'plane', '17': 'need', '18': 'fli', '19': 'hold', '20': 'amp', '21': 'us', '22': 'go', '23': 'would', '24': 'whi', '25': 'tri', '26': 'one', '27': 'still', '28': 'pleas', '29': 'airlin', '30': 'day', '31': 'ca'}

flipped_lookup = {v:k for k,v in lookup.items()}
string_to_analyze = "custom, servic, wa, aw, absolut, terribl, hate"
list_to_analyze = [w.strip() for w in string_to_analyze.split(',')]
analyze_value_list = [int(flipped_lookup[w]) for w in list_to_analyze]