这是下面的代码,我想在我的原始csv中写一个新列,这些列应该包含在我的代码期间创建的每个字典的值,并且我希望最后一个字典包含三个值,每个值都插入到单个列中。在csv中编写的代码已结束,但是也许有一种方法可以在每次我生成新字典时写入值。
我的csv路由代码:我无法弄清楚如何添加而不删除原始文件的内容
# -*- coding: UTF-8 -*-
# -*- coding: UTF-8 -*-
import codecs
import re
import os
import sys, argparse
import subprocess
import pprint
import csv
from itertools import islice
import pickle
import nltk
from nltk import tokenize
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
import pandas as pd
try:
import treetaggerwrapper
from treetaggerwrapper import TreeTagger, make_tags
print("import TreeTagger OK")
except:
print("Import TreeTagger pas Ok")
from itertools import islice
from collections import defaultdict
#export le lexique de sentiments
pickle_in = open("dict_pickle", "rb")
dico_lexique = pickle.load(pickle_in)
# extraction colonne verbatim
d_verbatim = {}
with open(sys.argv[1], 'r', encoding='cp1252') as csv_file:
csv_file.readline()
for line in csv_file:
token = line.split(';')
try:
d_verbatim[token[0]] = token[1]
except:
print(line)
#print(d_verbatim)
#Using treetagger
tagger = treetaggerwrapper.TreeTagger(TAGLANG='fr')
d_tag = {}
for key, val in d_verbatim.items():
newvalues = tagger.tag_text(val)
d_tag[key] = newvalues
#print(d_tag)
#lemmatisation
d_lemma = defaultdict(list)
for k, v in d_tag.items():
for p in v:
parts = p.split('\t')
try:
if parts[2] == '':
d_lemma[k].append(parts[0])
else:
d_lemma[k].append(parts[2])
except:
print(parts)
#print(d_lemma)
stopWords = set(stopwords.words('french'))
d_filtered_words = {k: [w for w in l if w not in stopWords and w.isalpha()] for k, l in d_lemma.items()}
print(d_filtered_words)
d_score = {k: [0, 0, 0] for k in d_filtered_words.keys()}
for k, v in d_filtered_words.items():
for word in v:
if word in dico_lexique:
if word
print(word, dico_lexique[word])
答案 0 :(得分:0)
您的编辑似乎使情况更糟,您最终删除了许多相关的上下文。我想我已经拼凑了您要做什么。它的核心似乎是对文本执行情感分析的例程。
我将从创建一个跟踪该类的类开始,例如:
class Sentiment:
__slots__ = ('positive', 'neutral', 'negative')
def __init__(self, positive=0, neutral=0, negative=0):
self.positive = positive
self.neutral = neutral
self.negative = negative
def __repr__(self):
return f'<Sentiment {self.positive} {self.neutral} {self.negative}>'
def __add__(self, other):
return Sentiment(
self.positive + other.positive,
self.neutral + other.neutral,
self.negative + other.negative,
)
这将使您可以在下面的函数中用[a + b for a, b in zip(map(int, dico_lexique[word]), d_score[k])]
来替换score += sentiment
之类的繁琐代码,并允许我们按名称引用各个值
然后我建议预处理您的腌制数据,这样您就不必在不相关的代码中间将内容转换为int
,例如:
with open("dict_pickle", "rb") as fd:
dico_lexique = {}
for word, (pos, neu, neg) in pickle.load(fd):
dico_lexique[word] = Sentiment(int(pos), int(neu), int(neg))
这将它们直接放入上面的类中,并且似乎与代码中的其他约束匹配。但我没有您的数据,因此无法检查。
分解所有的理解和循环之后,我们剩下了一个不错的例程来处理单个文本:
def process_text(text):
"""process the specified text
returns (words, filtered words, total sentiment score)
"""
words = []
filtered = []
score = Sentiment()
for tag in make_tags(tagger.tag_text(text)):
word = tag.lemma
words.append(word)
if word not in stopWords and lemma.isalpha():
filtered.append(word)
sentiment = dico_lexique.get(word)
if sentiment is not None:
score += sentiment
return words, filtered, score
,我们可以将其放入一个循环中,该循环从输入中读取行并将其发送到输出文件:
filename = sys.argv[1]
tempname = filename + '~'
with open(filename) as fdin, open(tempname, 'w') as fdout:
inp = csv.reader(fdin, delimiter=';')
out = csv.writer(fdout, delimiter=';')
# get the header, and blindly append out column names
header = next(inp)
out.writerow(header + [
'd_lemma', 'd_filtered_words', 'Positive Score', 'Neutral Score', 'Negative Score',
])
for row in inp:
# assume that second item contains the text we want to process
words, filtered, score = process_text(row[1])
extra_values = [
words, filtered,
score.positive, score.neutal, score.negative,
]
# add the values and write out
assert len(row) == len(header), "code needed to pad the columns out"
out.writerow(row + extra_values)
# only replace if everything succeeds
os.rename(tempname, filename)
我们写出一个不同的文件,仅在成功时重命名,这意味着如果代码崩溃,它不会留下部分写入的文件。我不愿像这样工作,并且倾向于使我的脚本从stdin
读取并写入stdout
。这样我就可以运行为:
$ python script.py < input.csv > output.csv
一切正常时,还可以让我运行:
$ head input.csv | python script.py
如果我只想测试输入的前几行,或者:
$ python script.py < input.csv | less
如果我想在生成时检查输出
请注意,该代码均未运行,因此其中可能存在错误,但实际上我可以看到该代码正在尝试执行的操作。理解力和“功能性”样式代码很棒,但是如果不小心,很容易变得不可读