Question

我有一个具有以下结构的 csv 文件：

Tokens,Tags,Polarities
"['i', 'agree', 'about', 'arafat', '.', 'i', 'mean', ',', 'shit', ',', 'they', 'even', 'gave', 'one', 'to', 'jimmy', 'carter', 'ha', '.', 'it', 'should', 'be', 'called', ""''"", 'the', 'worst', 'president', ""''"", 'prize', '.']","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]","[-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 0, 0, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1]"
"['musicmonday', 'britney', 'spears', '-', 'lucky', 'do', 'you', 'remember', 'this', 'song', '?', 'it', '`', 's', 'awesome', '.', 'i', 'love', 'it', '.']","[0, 1, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]","[-1, 2, 2, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1]"
"['wtf', '?', 'hilary', 'swank', 'is', 'coming', 'to', 'my', 'school', 'today', ',', 'just', 'to', 'chill', '.', 'lol', 'wow']","[0, 0, 1, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]","[-1, -1, 1, 1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1]"
"['my', '3-year-old', 'was', 'amazed', 'yesterday', 'to', 'find', 'that', ""'"", 'real', ""'"", '10', 'pin', 'bowling', 'is', 'nothing', 'like', 'it', 'is', 'on', 'the', 'wii', '...']","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0]","[-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 1, -1]"
"['God', 'damn', '.', 'That', 'Sony', 'remote', 'for', 'google', 'is', 'fucking', 'hideeeeeous', '!']","[0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0]","[-1, -1, -1, -1, -1, -1, -1, 0, -1, -1, -1, -1]"

我正在尝试按如下方式读取文件：

twitter_train = pd.read_csv('twitter_train.csv')

然后我可以看到它具有正确的结构：

twitter_train.head(3)

Tokens  Tags    Polarities
0   ['i', 'agree', 'about', 'arafat', '.', 'i', 'm...   [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...   [-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -...
1   ['musicmonday', 'britney', 'spears', '-', 'luc...   [0, 1, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...   [-1, 2, 2, -1, -1, -1, -1, -1, -1, -1, -1, -1,...
2   ['wtf', '?', 'hilary', 'swank', 'is', 'coming'...   [0, 0, 1, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...   [-1, -1, 1, 1, -1, -1, -1, -1, -1, -1, -1, -1,...

我想把每一列转换成一个列表，例如：

twitter_train_lists = twitter_train['Tokens'].tolist()

但是我的结构不正确，列表中的每个元素和每个列表本身都有一个额外的 \ 或 "：

['[\'i\', \'agree\', \'about\', \'arafat\', \'.\', \'i\', \'mean\', \',\', \'shit\', \',\', \'they\', \'even\', \'gave\', \'one\', \'to\', \'jimmy\', \'carter\', \'ha\', \'.\', \'it\', \'should\', \'be\', \'called\', "\'\'", \'the\', \'worst\', \'president\', "\'\'", \'prize\', \'.\']',
 "['musicmonday', 'britney', 'spears', '-', 'lucky', 'do', 'you', 'remember', 'this', 'song', '?', 'it', '`', 's', 'awesome', '.', 'i', 'love', 'it', '.']",
 "['wtf', '?', 'hilary', 'swank', 'is', 'coming', 'to', 'my', 'school', 'today', ',', 'just', 'to', 'chill', '.', 'lol', 'wow']",
 '[\'my\', \'3-year-old\', \'was\', \'amazed\', \'yesterday\', \'to\', \'find\', \'that\', "\'", \'real\', "\'", \'10\', \'pin\', \'bowling\', \'is\', \'nothing\', \'like\', \'it\', \'is\', \'on\', \'the\', \'wii\', \'...\']',
 "['God', 'damn', '.', 'That', 'Sony', 'remote', 'for', 'google', 'is', 'fucking', 'hideeeeeous', '!']"]

我如何从这个 csv 文件中正确提取列表以获得正确的结构：

[['i', 'agree', 'about', 'arafat', '.', 'i', 'mean', ',', 'shit', ',', 'they', 'even', 'gave', 'one', 'to', 'jimmy', 'carter', 'ha', '.', 'it', 'should', 'be', 'called', "''", 'the', 'worst', 'president', "''", 'prize', '.'],
['musicmonday', 'britney', 'spears', '-', 'lucky', 'do', 'you', 'remember', 'this', 'song', '?', 'it', '`', 's', 'awesome', '.', 'i', 'love', 'it', '.'],
['wtf', '?', 'hilary', 'swank', 'is', 'coming', 'to', 'my', 'school', 'today', ',', 'just', 'to', 'chill', '.', 'lol', 'wow'],
['my', '3-year-old', 'was', 'amazed', 'yesterday', 'to', 'find', 'that', "'", 'real', "'", '10', 'pin', 'bowling', 'is', 'nothing', 'like', 'it', 'is', 'on', 'the', 'wii', '...'],
['God', 'damn', '.', 'That', 'Sony', 'remote', 'for', 'google', 'is', 'fucking', 'hideeeeeous', '!']]

您可以在此处找到原始数据集文件：https://github.com/1tangerine1day/Aspect-Term-Extraction-and-Analysis/tree/master/data

更新：我尝试了另一种方法，但遇到了同样的问题：

import csv

with open('twitter_train.csv', newline='') as f:
    reader = csv.reader(f)
    data = list(reader)

另一个错误的输出：

print(data[3])

["['wtf', '?', 'hilary', 'swank', 'is', 'coming', 'to', 'my', 'school', 'today', ',', 'just', 'to', 'chill', '.', 'lol', 'wow']", '[0, 0, 1, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]', '[-1, -1, 1, 1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1]']

提前致谢！

Answer 1

您在 csv 中的信息实际上是一个字符串而不是列表。你需要让他们成为真正的清单。

twitter_train = pd.read_csv('twitter_train.csv')
twitter_train['Tokens'] = list(twitter_train['Tokens'].str.strip("['").str.rstrip("']").str.split("', '"))

将数据框列转换为列表的问题

1 个答案: