Question

我有一个训练数据集，其中单词列表中的一列。下面的例子

    target   id     values
0    eng     123    ['hi', 'hello','bye']
1    eng     124    ['my', 'name', 'is']

现在我有一个clean (text)函数，我想将其应用于values列。我在下面尝试过

train = pd.read_json('./file.json')
train['values'] = train['values'].apply(clean)

出现错误

ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

我知道我正在将.apply应用于不允许但不确定如何解决它的字符串数组。

请提出建议

编辑：添加clean（text）函数

def clean(text):
    import re
    from string import punctuation
    from nltk.stem import SnowballStemmer
    from nltk.corpus import stopwords

    def pad_str(s):
        return ' '+s+' '

    if pd.isnull(text):
        return ''


    # Empty question

    if type(text) != str or text=='':
        return ''

    # Clean the text
    text = re.sub("\'s", " ", text) 
    text = re.sub(" whats ", " what is ", text, flags=re.IGNORECASE)
    #many other regular expression operations



    # replace non-ascii word with special word    
    text = re.sub('[^\x00-\x7F]+', pad_str(SPECIAL_TOKENS['non-ascii']), text) 
    return text

Answer 1

问题出在您的clean函数上。此函数正在处理字符串，而不是字符串列表，但是您正在将字符串列表传递给它。您应该执行以下操作：

train['values'] = train['values'].apply(lambda x: [clean(s) for s in x])

当每一列都是数组时如何应用数据框

1 个答案: