将数据框列转换为列表的问题

时间:2021-02-06 00:26:23

标签: pandas dataframe csv nested-lists

我有一个具有以下结构的 csv 文件:

Tokens,Tags,Polarities
"['i', 'agree', 'about', 'arafat', '.', 'i', 'mean', ',', 'shit', ',', 'they', 'even', 'gave', 'one', 'to', 'jimmy', 'carter', 'ha', '.', 'it', 'should', 'be', 'called', ""''"", 'the', 'worst', 'president', ""''"", 'prize', '.']","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]","[-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 0, 0, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1]"
"['musicmonday', 'britney', 'spears', '-', 'lucky', 'do', 'you', 'remember', 'this', 'song', '?', 'it', '`', 's', 'awesome', '.', 'i', 'love', 'it', '.']","[0, 1, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]","[-1, 2, 2, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1]"
"['wtf', '?', 'hilary', 'swank', 'is', 'coming', 'to', 'my', 'school', 'today', ',', 'just', 'to', 'chill', '.', 'lol', 'wow']","[0, 0, 1, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]","[-1, -1, 1, 1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1]"
"['my', '3-year-old', 'was', 'amazed', 'yesterday', 'to', 'find', 'that', ""'"", 'real', ""'"", '10', 'pin', 'bowling', 'is', 'nothing', 'like', 'it', 'is', 'on', 'the', 'wii', '...']","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0]","[-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 1, -1]"
"['God', 'damn', '.', 'That', 'Sony', 'remote', 'for', 'google', 'is', 'fucking', 'hideeeeeous', '!']","[0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0]","[-1, -1, -1, -1, -1, -1, -1, 0, -1, -1, -1, -1]"

我正在尝试按如下方式读取文件:

twitter_train = pd.read_csv('twitter_train.csv')

然后我可以看到它具有正确的结构:

twitter_train.head(3)

Tokens  Tags    Polarities
0   ['i', 'agree', 'about', 'arafat', '.', 'i', 'm...   [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...   [-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -...
1   ['musicmonday', 'britney', 'spears', '-', 'luc...   [0, 1, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...   [-1, 2, 2, -1, -1, -1, -1, -1, -1, -1, -1, -1,...
2   ['wtf', '?', 'hilary', 'swank', 'is', 'coming'...   [0, 0, 1, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...   [-1, -1, 1, 1, -1, -1, -1, -1, -1, -1, -1, -1,...

我想把每一列转换成一个列表,例如:

twitter_train_lists = twitter_train['Tokens'].tolist()

但是我的结构不正确,列表中的每个元素和每个列表本身都有一个额外的 \"

['[\'i\', \'agree\', \'about\', \'arafat\', \'.\', \'i\', \'mean\', \',\', \'shit\', \',\', \'they\', \'even\', \'gave\', \'one\', \'to\', \'jimmy\', \'carter\', \'ha\', \'.\', \'it\', \'should\', \'be\', \'called\', "\'\'", \'the\', \'worst\', \'president\', "\'\'", \'prize\', \'.\']',
 "['musicmonday', 'britney', 'spears', '-', 'lucky', 'do', 'you', 'remember', 'this', 'song', '?', 'it', '`', 's', 'awesome', '.', 'i', 'love', 'it', '.']",
 "['wtf', '?', 'hilary', 'swank', 'is', 'coming', 'to', 'my', 'school', 'today', ',', 'just', 'to', 'chill', '.', 'lol', 'wow']",
 '[\'my\', \'3-year-old\', \'was\', \'amazed\', \'yesterday\', \'to\', \'find\', \'that\', "\'", \'real\', "\'", \'10\', \'pin\', \'bowling\', \'is\', \'nothing\', \'like\', \'it\', \'is\', \'on\', \'the\', \'wii\', \'...\']',
 "['God', 'damn', '.', 'That', 'Sony', 'remote', 'for', 'google', 'is', 'fucking', 'hideeeeeous', '!']"]

我如何从这个 csv 文件中正确提取列表以获得正确的结构:

[['i', 'agree', 'about', 'arafat', '.', 'i', 'mean', ',', 'shit', ',', 'they', 'even', 'gave', 'one', 'to', 'jimmy', 'carter', 'ha', '.', 'it', 'should', 'be', 'called', "''", 'the', 'worst', 'president', "''", 'prize', '.'],
['musicmonday', 'britney', 'spears', '-', 'lucky', 'do', 'you', 'remember', 'this', 'song', '?', 'it', '`', 's', 'awesome', '.', 'i', 'love', 'it', '.'],
['wtf', '?', 'hilary', 'swank', 'is', 'coming', 'to', 'my', 'school', 'today', ',', 'just', 'to', 'chill', '.', 'lol', 'wow'],
['my', '3-year-old', 'was', 'amazed', 'yesterday', 'to', 'find', 'that', "'", 'real', "'", '10', 'pin', 'bowling', 'is', 'nothing', 'like', 'it', 'is', 'on', 'the', 'wii', '...'],
['God', 'damn', '.', 'That', 'Sony', 'remote', 'for', 'google', 'is', 'fucking', 'hideeeeeous', '!']]

您可以在此处找到原始数据集文件:https://github.com/1tangerine1day/Aspect-Term-Extraction-and-Analysis/tree/master/data

更新: 我尝试了另一种方法,但遇到了同样的问题:

import csv

with open('twitter_train.csv', newline='') as f:
    reader = csv.reader(f)
    data = list(reader)

另一个错误的输出:

print(data[3])

["['wtf', '?', 'hilary', 'swank', 'is', 'coming', 'to', 'my', 'school', 'today', ',', 'just', 'to', 'chill', '.', 'lol', 'wow']", '[0, 0, 1, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]', '[-1, -1, 1, 1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1]']

提前致谢!

1 个答案:

答案 0 :(得分:0)

您在 csv 中的信息实际上是一个字符串而不是列表。你需要让他们成为真正的清单。

twitter_train = pd.read_csv('twitter_train.csv')
twitter_train['Tokens'] = list(twitter_train['Tokens'].str.strip("['").str.rstrip("']").str.split("', '"))