在csv文件中添加新列和来自differents dictionaries-comprehension的值

时间:2019-05-01 12:28:36

标签: python csv list-comprehension writer dictionary-comprehension

这是下面的代码,我想在我的原始csv中写一个新列,这些列应该包含在我的代码期间创建的每个字典的值,并且我希望最后一个字典包含三个值,每个值都插入到单个列中。在csv中编写的代码已结束,但是也许有一种方法可以在每次我生成新字典时写入值。

我的csv路由代码:我无法弄清楚如何添加而不删除原始文件的内容


# -*- coding: UTF-8 -*-
# -*- coding: UTF-8 -*-
import codecs 
import re
import os
import sys, argparse
import subprocess
import pprint
import csv
from itertools import islice
import pickle
import nltk
from nltk import tokenize
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
import pandas as pd

try:
    import treetaggerwrapper
    from treetaggerwrapper import TreeTagger, make_tags
    print("import TreeTagger OK")
except:
    print("Import TreeTagger pas Ok")

from itertools import islice
from collections import defaultdict

#export le lexique de sentiments
pickle_in = open("dict_pickle", "rb")
dico_lexique = pickle.load(pickle_in)


# extraction colonne verbatim
d_verbatim = {}

with open(sys.argv[1], 'r', encoding='cp1252') as csv_file:
    csv_file.readline()
    for line in csv_file:
        token = line.split(';')
        try:
            d_verbatim[token[0]] = token[1]
        except:
            print(line)

#print(d_verbatim)

#Using treetagger   
tagger = treetaggerwrapper.TreeTagger(TAGLANG='fr')
d_tag = {}
for key, val in d_verbatim.items(): 
        newvalues = tagger.tag_text(val)
        d_tag[key] = newvalues
#print(d_tag)


#lemmatisation  
d_lemma = defaultdict(list)
for k, v in d_tag.items():
    for p in v:
        parts = p.split('\t')
        try:
            if parts[2] == '':
                d_lemma[k].append(parts[0])
            else:
                d_lemma[k].append(parts[2]) 
        except:
            print(parts)
#print(d_lemma) 


stopWords = set(stopwords.words('french'))          
d_filtered_words = {k: [w for w in l if w not in stopWords and w.isalpha()] for k, l in d_lemma.items()}

print(d_filtered_words)     

d_score = {k: [0, 0, 0] for k in d_filtered_words.keys()}
for k, v in d_filtered_words.items():
    for word in v:
        if word in dico_lexique:
            if word 
            print(word, dico_lexique[word]) 

1 个答案:

答案 0 :(得分:0)

您的编辑似乎使情况更糟,您最终删除了许多相关的上下文。我想我已经拼凑了您要做什么。它的核心似乎是对文本执行情感分析的例程。

我将从创建一个跟踪该类的类开始,例如:

class Sentiment:
    __slots__ = ('positive', 'neutral', 'negative')

    def __init__(self, positive=0, neutral=0, negative=0):
        self.positive = positive
        self.neutral = neutral
        self.negative = negative

    def __repr__(self):
        return f'<Sentiment {self.positive} {self.neutral} {self.negative}>'

    def __add__(self, other):
        return Sentiment(
            self.positive + other.positive,
            self.neutral + other.neutral,
            self.negative + other.negative,
        )

这将使您可以在下面的函数中用[a + b for a, b in zip(map(int, dico_lexique[word]), d_score[k])]来替换score += sentiment之类的繁琐代码,并允许我们按名称引用各个值

然后我建议预处理您的腌制数据,这样您就不必在不相关的代码中间将内容转换为int,例如:

with open("dict_pickle", "rb") as fd:
    dico_lexique = {}
    for word, (pos, neu, neg) in pickle.load(fd):
        dico_lexique[word] = Sentiment(int(pos), int(neu), int(neg))

这将它们直接放入上面的类中,并且似乎与代码中的其他约束匹配。但我没有您的数据,因此无法检查。

分解所有的理解和循环之后,我们剩下了一个不错的例程来处理单个文本:

def process_text(text):
    """process the specified text
    returns (words, filtered words, total sentiment score)
    """
    words = []
    filtered = []
    score = Sentiment()

    for tag in make_tags(tagger.tag_text(text)):
        word = tag.lemma
        words.append(word)

        if word not in stopWords and lemma.isalpha():
            filtered.append(word)

        sentiment = dico_lexique.get(word)
        if sentiment is not None:
            score += sentiment

    return words, filtered, score

,我们可以将其放入一个循环中,该循环从输入中读取行并将其发送到输出文件:

filename = sys.argv[1]
tempname = filename + '~'

with open(filename) as fdin, open(tempname, 'w') as fdout:
    inp = csv.reader(fdin, delimiter=';')
    out = csv.writer(fdout, delimiter=';')

    # get the header, and blindly append out column names
    header = next(inp)
    out.writerow(header + [
        'd_lemma', 'd_filtered_words', 'Positive Score', 'Neutral Score', 'Negative Score',
    ])

    for row in inp:
        # assume that second item contains the text we want to process
        words, filtered, score = process_text(row[1])
        extra_values = [
            words, filtered,
            score.positive, score.neutal, score.negative,
        ]
        # add the values and write out
        assert len(row) == len(header), "code needed to pad the columns out"
        out.writerow(row + extra_values)

# only replace if everything succeeds
os.rename(tempname, filename)

我们写出一个不同的文件,仅在成功时重命名,这意味着如果代码崩溃,它不会留下部分写入的文件。我不愿像这样工作,并且倾向于使我的脚本从stdin读取并写入stdout。这样我就可以运行为:

$ python script.py < input.csv > output.csv

一切正常时,还可以让我运行:

$ head input.csv | python script.py

如果我只想测试输入的前几行,或者:

$ python script.py < input.csv | less

如果我想在生成时检查输出

请注意,该代码均未运行,因此其中可能存在错误,但实际上我可以看到该代码正在尝试执行的操作。理解力和“功能性”样式代码很棒,但是如果不小心,很容易变得不可读