Question

基本上我想使用python做两件事： 1）将结果列表设为单词列表，而不是列表列表； 2）过滤掉长度为1个字符的单词。

我必须从词典列表中提取单词，使单词变为小写，然后对单词进行过滤，以便只有大于1的单词才是结果列表的一部分。我必须使用map（）和列表理解，但是我也不知道该怎么做。还要求我使用re.spilt（）拆分单词并摆脱不必要的标点符号。

到目前为止，我已经能够提取词典列表中的相关部分，将单词拆分成小写。但是我得到的是一个列表列表，其元素是单词。

我希望结果只是一个长度为2个字符或更多的单词列表。

def extract_tweets(some_list):
    tweetlist = []
    for each_tweet in some_list:
        text = each_tweet['text']
        lowercase = text.lower()
        tweetlist.append(lowercase)
    tweetwords = []
    for words in tweetlist:
        word = re.split('\W+', words)
        tweetwords.append(word)
    return(tweetwords)

Answer 1

一个简单的列表理解将帮助您：

tweetwords = [word for word in tweetwords if len(word) > 1]

Answer 2

要正常工作，您的函数 extract_tweets 需要使用词典列表作为参数。所以 some_list 看起来像这样：

some_list = [
    {
        'text': "Hello world!"
    },
    {
        'text': "The sun is shinning, the sky is blue."
    },
]

实际上，第一个循环会提取文本，因此最好将其称为 text 或 text_list （而不是 tweetlist ）。您会得到：

['hello world!', 'the sun is shinning, the sky is blue.']

要提取文本中的单词，最好使用findall而不是split，因为如果文本开头或结尾为split，则可以使用空字符串像我的示例一样，是非单词。

要查找文本中的所有单词，可以使用：

words = re.findall('\w+', text)

注意：\w+ RegEx还将捕获数字和下划线。为了避免这种情况，您应该使用否定的类[^\W\d_]+。

findall的结果是单词列表。要过滤长度大于1的单词，可以将filter与具有条件的函数或理解列表一起使用：

words = list(filter(lambda w: len(w) > 1, words))
# or:
words = [w for w in words if len(w) > 1]

这是重构的代码：

import re
import pprint


def extract_tweets(some_list):
    texts = []
    for each_tweet in some_list:
        text = each_tweet['text']
        lowercase = text.lower()
        texts.append(lowercase)
    tweet_words = []
    for text in texts:
        words = re.findall('[^\W\d_]+', text)
        words = [w for w in words if len(w) > 1]
        tweet_words.append(words)
    return tweet_words

通过以下示例……

some_list = [
    {
        'text': "Hello world!"
    },
    {
        'text': "The sun is shinning, the sky is blue."
    },
    {
        'text': "1, 2, 3, four"
    },
    {
        'text': "not a word"
    },
]

pprint.pprint(extract_tweets(some_list))

…您得到：

[['hello', 'world'],
 ['the', 'sun', 'is', 'shinning', 'the', 'sky', 'is', 'blue'],
 ['four'],
 ['not', 'word']]

使用extend而不是append，您将得到：

['hello',
 'world',
 'the',
 'sun',
 'is',
 'shinning',
 'the',
 'sky',
 'is',
 'blue',
 'four',
 'not',
 'word']

如何从列表列表中提取单词并按长度过滤单词？

2 个答案: