re.sub错误与“预期字符串或字节类对象”

时间:2017-05-01 22:47:20

标签: python regex pandas nltk

我已阅读有关此错误的多篇帖子,但我仍然无法弄明白。当我尝试循环我的函数时:

def fix_Plan(location):
    letters_only = re.sub("[^a-zA-Z]",  # Search for all non-letters
                          " ",          # Replace all non-letters with spaces
                          location)     # Column and row to search    

    words = letters_only.lower().split()     
    stops = set(stopwords.words("english"))      
    meaningful_words = [w for w in words if not w in stops]      
    return (" ".join(meaningful_words))    

col_Plan = fix_Plan(train["Plan"][0])    
num_responses = train["Plan"].size    
clean_Plan_responses = []

for i in range(0,num_responses):
    clean_Plan_responses.append(fix_Plan(train["Plan"][i]))

这是错误:

Traceback (most recent call last):
  File "C:/Users/xxxxx/PycharmProjects/tronc/tronc2.py", line 48, in <module>
    clean_Plan_responses.append(fix_Plan(train["Plan"][i]))
  File "C:/Users/xxxxx/PycharmProjects/tronc/tronc2.py", line 22, in fix_Plan
    location)  # Column and row to search
  File "C:\Users\xxxxx\AppData\Local\Programs\Python\Python36\lib\re.py", line 191, in sub
    return _compile(pattern, flags).sub(repl, string, count)
TypeError: expected string or bytes-like object

4 个答案:

答案 0 :(得分:37)

正如你在评论中所说,一些值似乎是浮点数,而不是字符串。在将其传递给re.sub之前,您需要将其更改为字符串。最简单的方法是在使用location时将str(location)更改为re.sub。即使它已经是str,也不会有任何影响。

letters_only = re.sub("[^a-zA-Z]",  # Search for all non-letters
                          " ",          # Replace all non-letters with spaces
                          str(location))

答案 1 :(得分:0)

我想更好的方法是使用re.match()函数。这是一个可能对您有帮助的示例。

import re
import nltk
from nltk.tokenize import word_tokenize
nltk.download('punkt')
sentences = word_tokenize("I love to learn NLP \n 'a :(")
#for i in range(len(sentences)):
sentences = [word.lower() for word in sentences if re.match('^[a-zA-Z]+', word)]  
sentences

答案 2 :(得分:0)

最简单的解决方案是将python str函数应用于您要遍历的列。

如果您使用的是熊猫 可以实现为

dataframe ['column_name'] = dataframe ['column_name']。apply(str)

答案 3 :(得分:0)

我遇到了同样的问题。而且很有趣的是,每次我做某事,直到我意识到字符串中有两个特殊字符时,问题才解决。

例如,对我来说,文字有两个字符:

&lrm; (Left-to-Right Mark)&zwnj; (Zero-width non-joiner)

我的解决办法是删除这两个字符,问题解决了。


    import re
    mystring = "&lrm;Some Time W&zwnj;e"
    mystring  = re.sub(r"&lrm;","",mystring)
    mystring  = re.sub(r"&zwnj;","",mystring)

我希望这能帮助像我这样有问题的人。