Question

我正在预处理包含文本数据格式的“作业描述”列的数据。我已经创建了一个数据框，并尝试应用一个函数来预处理数据，但是将函数应用于数据框中的列时，却收到“预期字符串或类似字节的对象”的错误。请在下面参考我的代码并获得帮助。

####################################################    
#Function to pre process the data    
def clean_text(text):
        """
        Applies some pre-processing on the given text.

        Steps :
        - Removing HTML tags
        - Removing punctuation
        - Lowering text
        """

        # remove HTML tags
        text = re.sub(r'<.*?>', '', text)

        # remove the characters [\], ['] and ["]
        text = re.sub(r"\\", "", text)    
        text = re.sub(r"\'", "", text)    
        text = re.sub(r"\"", "", text)    

        # convert text to lowercase
        text = text.strip().lower()

        #replace all numbers with empty spaces
        text = re.sub("[^a-zA-Z]",  # Search for all non-letters
                              " ",          # Replace all non-letters with spaces
                              str(text))

        # replace punctuation characters with spaces
        filters='!"\'#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n'
        translate_dict = dict((c, " ") for c in filters)
        translate_map = str.maketrans(translate_dict)
        text = text.translate(translate_map)

        return text
#############################################################     
#To apply "Clean_text" function to job_description column in data frame        
df['jobnew']=df['job_description'].apply(clean_text)
        ---------------------------------------------------------------------------
        TypeError                                 Traceback (most recent call last)
        <ipython-input-33-c15402ac31ba> in <module>()
        ----> 1 df['jobnew']=df['job_description'].apply(clean_text)

        ~\Anaconda3\lib\site-packages\pandas\core\series.py in apply(self, func, convert_dtype, args, **kwds)
           3192             else:
           3193                 values = self.astype(object).values
        -> 3194                 mapped = lib.map_infer(values, f, convert=convert_dtype)
           3195 
           3196         if len(mapped) and isinstance(mapped[0], Series):

        pandas/_libs/src\inference.pyx in pandas._libs.lib.map_infer()

        <ipython-input-30-5f24dbf9d559> in clean_text(text)
             10 
             11     # remove HTML tags
        ---> 12     text = re.sub(r'<.*?>', '', text)
             13 
             14     # remove the characters [\], ['] and ["]

        ~\Anaconda3\lib\re.py in sub(pattern, repl, string, count, flags)
            190     a callable, it's passed the Match object and must return
            191     a replacement string to be used."""
        --> 192     return _compile(pattern, flags).sub(repl, string, count)
            193 
            194 def subn(pattern, repl, string, count=0, flags=0):

        TypeError: expected string or bytes-like object

Answer 1

函数re.sub告诉您，您使用不是字符串的某种东西（参数text）对其进行了调用。由于是通过在apply的内容上调用df['job_description']来调用它的，因此很明显问题必须出在如何创建此数据框上……而您没有显示代码的那部分

构造您的数据框，以便此列仅包含字符串，并且程序将在至少几行的情况下无错误运行。

错误：预期的字符串或类似字节的对象

1 个答案: