Question

我正在尝试在Python中进行一些文本清理以进行情绪分析。但是我没有将所有文本混为一谈并将它们分开，而是希望用每个句子清理文本。为此，我在我的函数中使用了[In] data = pd.read_csv('twitter_AC.csv') [In] data.head() 0 We're #hiring! Click to apply: Vaccine Special... 1 Can you recommend anyone for this #job? Vaccin... 2 We're #hiring! Read about our latest #job open... 3 We're #hiring! Read about our latest #job open... 4 We're #hiring! Read about our latest #job open... Name: text, dtype: object [In] def text_process(text): ''' Takes in a string of text, then performs the following: 1. Remove all punctuation 2. Remove all stopwords 3. Return the cleaned text as a list of sentences ''' for i in text: nopunc = [word for word in i if word not in string.punctuation] nopunc = ''.join(nopunc) return [nopunc.lower()] [In] text_process(data) [Out] ['were hiring read about our latest job opening here immunization rn httpstcopxczq5zrhr healthcare fairfax va careerarc']循环，但问题是它只返回我数据框中的1个句子。

 import random as r

 def generate_uuid():
        random_string = ''
        random_str_seq = "0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ"
        uuid_format = [8, 4, 4, 4, 12]
        for n in uuid_format:
            for i in range(0,n):
                random_string += str(random_str_seq[r.randint(0, len(random_str_seq) - 1)])
            if n != 12:
                random_string += '-'
        return random_string

我无法弄清楚为什么函数没有输出我的数据帧中的所有行。另外，我不明白为什么它只取出一行而不是第一行。

Answer 1

您正在迭代集合中的每个元素，但每次都要覆盖nopunc变量。

所以，你只是返回遍历的最后一行。

试试这个：

def text_process(text):                                                                                                                        

    '''
    Takes in a string of text, then performs the following:
    1. Remove all punctuation
    2. Remove all stopwords
    3. Return the cleaned text as a list of sentences
    '''
    nopuncList = []
    for i in text:
        nopunc = [word for word in i if word not in string.punctuation]
        nopunc = ''.join(nopunc)
        nopuncList.append(nopunc.lower())
    return nopuntList

Answer 2

你的循环有点格格不入。我建议使用list comprehension之类的：

def text_process(text):  
    return [''.join(word for word in i if word not in string.punctuation).lower() 
            for i in text]

Answer 3

您一次又一次地覆盖nopunc变量。因此，您的函数返回最后一行。尝试使用空列表并在每次迭代时将结果附加到其中。

def text_process(text):

    '''
    Takes in a string of text, then performs the following:
    1. Remove all punctuation
    2. Remove all stopwords
    3. Return the cleaned text as a list of sentences
    '''
    result = list()
    for i in text:
        nopunc = [word for word in i if word not in string.punctuation]
        nopunc = ''.join(nopunc)
        result.append(nopunc.lower())
    return result

希望这有帮助。

在for循环中用Python清除每个句子的文本

3 个答案: