Question

我有一个像这样的数据框

product        complaint
Student Loan   words words words
Mortgage       words words words
Credit Card    words words words
Student Loan   words words words

我正在尝试预处理每个投诉单元格中的单词，但我想根据产品进行预处理。这行代码将我的预处理功能应用于“投诉”列中的每个单元格，并且工作正常

df['complaint'] =df['complaint'].apply(lambda x: pre_process(x))

我的预处理功能基本上标记了文本，删除了停用词并使投诉无法使用。

我想通过删除基于产品的自定义停用词列表来更进一步。因此，抵押贷款，学生贷款和信用卡都有一个不同的停用词列表，我只想申请那些相关的投诉。如果可以的话，沿着这些方向行事：

df['complaint'] =df['complaint'].apply(lambda x: pre_process(x,Student_stopwords) if df['product'] == "Student Loan')
df['complaint'] =df['complaint'].apply(lambda x: pre_process(x,mortgage_stopwords) if df['product'] == "Mortgage")
df['complaint'] =df['complaint'].apply(lambda x: pre_process(x,creditcard_stopwords) if df['product'] == "Credit Card")

我知道这可能非常低效，但这就是我想象的方式，除了我不知道如何仅将我的预处理功能应用于特定的细胞。

任何帮助都将不胜感激。

Answer 1

您可以定义单独的功能，然后使用apply。像这样：

def which_preproc(row):
    return student if row['product'] == 'student' 
    # similarly for others and other preprocessing you want

然后使用apply：

df['complaint'] = df.apply(which_preproc, axis=1)

Answer 2

试试这个 -

def pre_process_wrapper(x):
    complaint = x['complaint']
    if complaint=="Mortgage":
        complaint = pre_process(complaint, Student_stopwords)
    elif:
        ...

    return complaint
df['complaint'] =df.apply(pre_process, axis=1)

我编写了一个额外的包装函数来调用你的pre_process函数并返回pre_processed complaint。您基本上可以将其用于数据帧级别的apply函数。

Answer 3

试试这段代码 -

df['complaint'] = df.apply(lambda row: pre_process(row[1],row[0]), axis=1)['complaint']

基于其他列中的条件将lambda应用于pandas数据帧

3 个答案: