我有一个很大的制表符指定的csv文件:第一个制表符用于情感词,第二个制表符用于八种基本情感,加上值positive
和negative
,最后一个制表符是布尔值,如果第二个制表符值适合第一个。
文件摘录:
snarl anger 1
snarl anticipation 0
snarl disgust 1
snarl fear 0
snarl joy 0
snarl negative 1
snarl positive 0
snarl sadness 0
snarl surprise 0
snarl trust 0
snarling anger 1
snarling anticipation 0
snarling disgust 0
snarling fear 0
snarling joy 0
snarling negative 1
snarling positive 0
snarling sadness 0
snarling surprise 0
snarling trust 0
到目前为止,我的代码要这样做:
import csv
from pprint import pprint
from itertools import groupby
l = list(csv.reader(open('NRC-Emotion-Lexicon-Wordlevel-v0.92.txt')))
f = lambda x: x[-1] #manipulate number to see different results
{k:[tuple(x[0:1]) for x in v] for k,v in groupby(sorted(l[1:], key=f), f)}
pprint(l)
我当前的输出看起来不太好:
['asylum\tanger\t0'],
['asylum\tanticipation\t0'],
['asylum\tdisgust\t0'],
['asylum\tfear\t1'],
['asylum\tjoy\t0'],
['asylum\tnegative\t1'],
['asylum\tpositive\t0'],
['asylum\tsadness\t0'],
['asylum\tsurprise\t0'],
['asylum\ttrust\t0'],
我的问题是:如何为每个重复的情感单词创建一个带有唯一键的列表字典(将10个重复的重复次数减少为1),并且仅在其中包含第二个制表符元素字典键的列表,当它们的布尔值为1时?
任何帮助将不胜感激!
编辑:作为答复之一,所需输出的示例如下所示:
{'snarl': ['anger', 'disgust'], #included in list due to having '1', ignoring 'positve' and 'negative'
'snarling': ['anger'], #etc...
}
编辑2:
文件的第一行和最后一行为空,正如我在每个注释的答案中所述。
答案 0 :(得分:2)
这是一种方法。使用defaultdict
例如:
import csv
from collections import defaultdict
d = defaultdict(list)
with open(filename) as infile:
reader = csv.reader(infile, delimiter="\t")
for row in reader:
if row[2] == '1':
d[row[0]].append(row[1])
print(d)
根据评论进行编辑
from collections import defaultdict
d = defaultdict(list)
with open(filename) as infile:
for row in infile:
if row.strip():
val = row.split()
if val[2] == '1':
d[val[0]].append(val[1])
print(d)
答案 1 :(得分:1)
您可以在迭代collections.defaultdict
对象的同时使用csv.reader
和更新列表字典。
您的条件会添加到if
语句中,请小心通过int
将数字转换为整数。
import csv
from collections import defaultdict
from io import StringIO
x = StringIO("""snarl anger 1
snarl anticipation 0
...
snarling surprise 0
snarling trust 0""")
d = defaultdict(list)
# replace x with open('file.csv', 'r')
with x as fin:
reader = filter(None, csv.reader(x, delimiter=' ', skipinitialspace=True))
# or, reader = filter(None, csv.reader(x, delimiter='\t'))
for word, emotion, num in reader:
if int(num):
d[word].append(emotion)
结果:
print(d)
defaultdict(list,
{'snarl': ['anger', 'disgust', 'negative'],
'snarling': ['anger', 'negative']})
答案 2 :(得分:0)
我想你几乎接近答案了。但是,当您调用csv.reader时,没有指定定界符(这意味着它默认以逗号作为定界符)
>>> from itertools import groupby
>>> l = map(str.split, open('NRC-Emotion-Lexicon-Wordlevel-v0.92.txt').readlines())
>>> f = lambda x: x[1]
>>> {k:set(e[0] for e in v) for k,v in groupby(sorted(filter(bool, l), key=f), f)}
{'anger': {'snarling', 'snarl'}, 'anticipation': {'snarling', 'snarl'}, 'disgust': {'snarling', 'snarl'}, 'fear': {'snarling', 'snarl'}, 'joy': {'snarling', 'snarl'}, 'negative': {'snarling', 'snarl'}, 'positive': {'snarling', 'snarl'}, 'sadness': {'snarling', 'snarl'}, 'surprise': {'snarling', 'snarl'}, 'trust': {'snarling', 'snarl'}}
答案 3 :(得分:0)
这就是我要做的。如果愿意,也可以使用collections.defaultdict
(而不是setdefault
):
import csv
with open('NRC-Emotion-Lexicon-Wordlevel-v0.92.txt', newline='') as file:
l = [row[:-1] for row in csv.reader(file, delimiter='\t')
if row and row[-1] == '1'] # Not empty and last elem is true.
d = {}
for e_word, basic in l:
d.setdefault(e_word, []).append(basic)
print('dictionary d:\n', d)
输出:
dictionary d:
{'snarl': ['anger', 'disgust', 'negative'], 'snarling': ['anger', 'negative']}