Question

我会尽量使自己清楚：我有5万条tweet，我想进行文本挖掘，并且我想改善自己的代码。数据如下（sample_data）。

我有兴趣对我清理和标记化的单词（它们是twToken键的值）进行语义化

sample_data = [{'twAuthor': 'Jean Lassalle',
                'twMedium': 'iPhone',
                'nFav': None,
                'nRT': '33',
                'isRT': True,
                'twText': ' RT @ColPeguyVauvil : @jeanlassalle "allez aux bouts de vos rêves" ',
                'twParty': 'Résistons!',
                'cleanText': ' rt colpeguyvauvil jeanlassalle allez aux bouts de vos rêves ',
                'twToken': ['colpeguyvauvil', 'jeanlassalle', 'allez', 'bouts', 'rêves']},
               {'twAuthor': 'Jean-Luc Mélenchon',
                'twMedium': 'Twitter Web Client',
                'nFav': '806',
                'nRT': '375',
                'isRT': False,
                'twText': ' (2/2) Ils préfèrent créer une nouvelle majorité cohérente plutôt que les alliances à géométrie variable opportunistes de leur direction. ',
                'twParty': 'La France Insoumise',
                'cleanText': ' 2 2 ils préfèrent créer une nouvelle majorité cohérente plutôt que les alliances à géométrie variable opportunistes de leur direction ',
                'twToken': ['2', '2', 'préfèrent', 'créer', 'nouvelle', 'majorité', 'cohérente', 'plutôt', 'alliances', 'géométrie', 'variable', 'opportunistes', 'direction']},
               {'twAuthor': 'Nathalie Arthaud',
                'twMedium': 'Android',
                'nFav': '37',
                'nRT': '24',
                'isRT': False,
                'twText': ' #10mai Commemoration fin de l esclavage. Reste à supprimer l esclavage salarial defendu par #Macron et Hollande ',
                'twParty': 'Lutte Ouvrière',
                'cleanText': ' 10mai commemoration fin de l esclavage reste à supprimer l esclavage salarial defendu par macron et hollande ',
                'twToken': ['10mai', 'commemoration', 'fin', 'esclavage', 'reste', 'supprimer', 'esclavage', 'salarial', 'defendu', 'macron', 'hollande']
               }]

但是，Python中没有可靠的法语lemmatizer。因此，我使用了一些资源来拥有自己的法语单词lemmatizer词典。该字典看起来像这样：

sample_lemmas = [{"ortho":"rêves","lemme":"rêve","cgram":"NOM"},
                 {"ortho":"opportunistes","lemme":"opportuniste","cgram":"ADJ"},
                 {"ortho":"préfèrent","lemme":"préférer","cgram":"VER"},
                 {"ortho":"nouvelle","lemme":"nouveau","cgram":"ADJ"},
                 {"ortho":"allez","lemme":"aller","cgram":"VER"},
                 {"ortho":"défendu","lemme":"défendre","cgram":"VER"}]

因此ortho是单词的书面形式（例如， processed ），lemme是单词的词形化形式（例如， process < / em>）和cgram是单词的语法类别（例如，动词的 VER ）。

所以我要做的是为每条推文创建一个twLemmas密钥，这是从twToken列表派生的引理的列表。因此，我遍历sample_data中的每个tweet，然后遍历twToken中的每个令牌，查看令牌在我的引理字典sample_lemmas中是否存在，如果存在，则检索引理从sample_lemmas字典中添加到每个twLemmas键中的列表中。如果没有，我只是将单词添加到列表中。

我的代码如下：

list_of_ortho = [] #List of words used to compare if a token doesn't exist in my lemmas dictionary for wordDict in sample_lemmas: #This loop feeds this list with each word list_of_ortho.append(wordDict["ortho"]) for elemList in sample_data: #Here I iterate over each tweet in my data list_of_lemmas = [] #This is the temporary list which will be the value to each twLemmas key for token in elemList["twToken"]: #Here, I iterate over each token/word of a tweet for wordDict in sample_lemmas: if token == wordDict["ortho"]: list_of_lemmas.append(wordDict["lemme"]) if token not in list_of_ortho: #And this is to add a word to my list if it doesn't exist in my lemmas dictionary list_of_lemmas.append(token) elemList["lemmas"] = list_of_lemmas sample_data

该循环工作正常，但是大约需要4个小时才能完成。现在，我知道我既不是程序员，也不是Python专家，而且我知道无论完成什么工作都需要时间。但是，这就是为什么我想问你，是否有人对我如何改进代码有更好的想法？

如果有人能抽出时间来理解我的代码并为我提供帮助，谢谢。我希望我足够清楚（对不起，英语不是我的母语）。

Answer 1

使用将正射影像映射到词素的字典：

ortho_to_lemme = {word_dict["ortho"]: word_dict["lemme"] for word_dict in sample_lemmas}
for tweet in sample_data:
    tweet["twLemmas"] = [
        ortho_to_lemme.get(token, token) for token in tweet["twToken"]
    ]

改进循环-尝试比较2个字典列表

1 个答案: