TypeError:序列项1:期望找到类似字节的对象,str

时间:2016-10-02 15:15:00

标签: python regex python-3.x

我正在尝试使用Python 3中的正则表达式从文本文件中的wiki标题转储中提取英文标题.Wiki转储包含其他语言的标题和一些符号。以下是我的代码:

with open('/Users/some/directory/title.txt', 'rb')as f:
    text=f.read()
    letters_only = re.sub(b"[^a-zA-Z]", " ", text)
    words = letters_only.lower().split() 
print(words)

但是我收到了一个错误:

TypeError: sequence item 1: expected a bytes-like object, str found 

行:letters_only = re.sub(b"[^a-zA-Z]", " ", text)

但是,我使用b''将输出作为字节类型,下面是文本文件的示例:

Destroy-Oh-Boy!!
!!Que_Corra_La_Voz!!
!!_(chess)
!!_(disambiguation)
!'O!Kung
!'O!Kung_language
!'O-!khung_language
!337$P34K
!=
!?
!?!
!?Revolution!?
!?_(chess)
!A_Luchar!
!Action_Pact!
!Action_pact!
!Adios_Amigos!
!Alabadle!
!Alarma!
!Alarma!_(album)
!Alarma!_(disambiguation)
!Alarma!_(magazine)
!Alarma!_Records
!Alarma!_magazine
!Alfaro_Vive,_Carajo!
!All-Time_Quarterback!
!All-Time_Quarterback!_(EP)
!All-Time_Quarterback!_(album)
!Alla_tu!
!Amigos!
!Amigos!_(Arrested_Development_episode)
!Arriba!_La_Pachanga
!Ask_a_Mexican!
!Atame!
!Ay,_Carmela!_(film)
!Ay,_caramba!
!BANG!
!Bang!
!Bang!_TV
!Basta_Ya!
!Bastardos!
!Bastardos!_(album)
!Bastardos_en_Vivo!
!Bienvenido,_Mr._Marshall!
!Ciauetistico!
!Ciautistico!
!DOCTYPE
!Dame!_!Dame!_!Dame!
!Decapitacion!
!Dos!
!Explora!_Science_Center_and_Children's_Museum
!F
!Forward,_Russia!
!Forward_Russia!
!Ga!ne_language
!Ga!nge_language
!Gã!ne
!Gã!ne_language
!Gã!nge_language
!HERO
!Happy_Birthday_Guadaloupe!
!Happy_Birthday_Guadalupe!
!Hello_Friends

我在线搜索但无法成功。任何帮助将不胜感激。

4 个答案:

答案 0 :(得分:6)

问题在于您提供的repl参数,它不是bytes对象:

letters_only = re.sub(b"[^a-zA-Z]", " ", b'Hello2World')
# TypeError: sequence item 1: expected a bytes-like object, str found

相反,提供repl作为字节实例b" "

letters_only = re.sub(b"[^a-zA-Z]", b" ", b'Hello2World')
print(letters_only) 
b'Hello World'

注意:如果您没有查找b序列,请不要在文字前加上rb,并且不要使用byte打开文件。

答案 1 :(得分:4)

您必须在二进制和文本模式之间进行选择。

您以rb打开文件,然后可以使用re.sub(b"[^a-zA-Z]", b" ", text)textbytes个对象)

或者您将文件打开为r,然后您可以使用re.sub("[^a-zA-Z]", " ", text)textstr个对象)

第二种解决方案更“经典”。

答案 2 :(得分:2)

当替换字符串不是时,您不能使用#ifndef COLOR_H #define COLOR_H #include <string> #include <iostream> #if defined(_WIN32) && !defined(JPK_USE_ANSI) #include <windows.h> #endif // _WIN32 namespace jpk { class color_t { public: color_t(const unsigned int& col); #if !defined(_WIN32) || defined(JPK_USE_ANSI) color_t(const std::string& esc); #endif // _WIN32 virtual ~color_t(void); color_t(const color_t&) = delete; color_t& operator=(const color_t&) = delete; void use(std::ostream& out) const; friend std::ostream& operator<<(std::ostream&, const jpk::color_t&); private: #if defined(_WIN32) && !defined(JPK_USE_ANSI) const unsigned int c; static bool reset_attr_got; static WORD reset_attr; #else const std::string seq; #endif // _WIN32 }; struct color { enum colors { BLACK_F, BLUE_F, GREEN_F, CYAN_F, RED_F, MAGENTA_F, BROWN_F, GREY_F, DARKGREY_F, LIGHTBLUE_F, LIGHTGREEN_F, LIGHTCYAN_F, LIGHTRED_F, LIGHTMAGENTA_F, YELLOW_F, WHITE_F, BLACK_B, BLUE_B, GREEN_B, CYAN_B, RED_B, MAGENTA_B, YELLOW_B, WHITE_B, RESET }; color(void) = delete; ~color(void) = delete; static color_t black_f; static color_t red_f; static color_t green_f; static color_t brown_f; static color_t blue_f; static color_t magenta_f; static color_t cyan_f; static color_t grey_f; static color_t dark_grey_f; static color_t light_red_f; static color_t light_green_f; static color_t yellow_f; static color_t light_blue_f; static color_t light_magenta_f; static color_t light_cyan_f; static color_t white_f; static color_t black_b; static color_t red_b; static color_t green_b; static color_t yellow_b; static color_t blue_b; static color_t magenta_b; static color_t cyan_b; static color_t white_b; static color_t reset; }; } #if !defined(_WIN32) || defined(JPK_USE_ANSI) std::string getAnsiEsc(const unsigned int& col) { switch(col) { case jpk::color::BLACK_F: return "\033[22;30m"; case jpk::color::RED_F: return "\033[22;31m"; case jpk::color::GREEN_F: return "\033[22;32m"; case jpk::color::BROWN_F: return "\033[22;33m"; case jpk::color::BLUE_F: return "\033[22;34m"; case jpk::color::MAGENTA_F: return "\033[22;35m"; case jpk::color::CYAN_F: return "\033[22;36m"; case jpk::color::GREY_F: return "\033[22;37m"; case jpk::color::DARKGREY_F: return "\033[01;30m"; case jpk::color::LIGHTRED_F: return "\033[01;31m"; case jpk::color::LIGHTGREEN_F: return "\033[01;32m"; case jpk::color::YELLOW_F: return "\033[01;33m"; case jpk::color::LIGHTBLUE_F: return "\033[01;34m"; case jpk::color::LIGHTMAGENTA_F: return "\033[01;35m"; case jpk::color::LIGHTCYAN_F: return "\033[01;36m"; case jpk::color::WHITE_F: return "\033[01;37m"; case jpk::color::BLACK_B: return "\033[40m"; case jpk::color::RED_B: return "\033[41m"; case jpk::color::GREEN_B: return "\033[42m"; case jpk::color::YELLOW_B: return "\033[43m"; case jpk::color::BLUE_B: return "\033[44m"; case jpk::color::MAGENTA_B: return "\033[45m"; case jpk::color::CYAN_B: return "\033[46m"; case jpk::color::WHITE_B: return "\033[47m"; case jpk::color::RESET: return "\033[0m"; } return ""; } #endif // _WIN32 jpk::color_t::color_t(const unsigned int& col) : #if defined(_WIN32) && !defined(JPK_USE_ANSI) c(col) { if(!reset_attr_got) { CONSOLE_SCREEN_BUFFER_INFO csbi; GetConsoleScreenBufferInfo(GetStdHandle(STD_OUTPUT_HANDLE), &csbi); reset_attr = csbi.wAttributes; reset_attr_got = true; } #else seq(jpk::getAnsiEsc(col)) {} jpk::color_t::color_t(const std::string& esc) : seq(esc) { #endif // _WIN32 } jpk::color_t::~color_t(void) {} #if defined(_WIN32) && !defined(JPK_USE_ANSI) bool jpk::color_t::reset_attr_got(false); WORD jpk::color_t::reset_attr(0); void jpk::color_t::use(std::ostream& out) const { if(c <= jpk::color::RESET) { HANDLE hConsole(GetStdHandle(STD_OUTPUT_HANDLE)); CONSOLE_SCREEN_BUFFER_INFO csbi; GetConsoleScreenBufferInfo(hConsole, &csbi); if(c < jpk::color::BLACK_B) SetConsoleTextAttribute(hConsole, (csbi.wAttributes & 0xFFF0) | (WORD)c); else if((c > jpk::color::WHITE_F) && (c < jpk::color::RESET)) SetConsoleTextAttribute(hConsole, (csbi.wAttributes & 0xFF0F) | (((WORD)(c - jpk::color::BLACK_B)) << 4)); else if(c == jpk::color::RESET) SetConsoleTextAttribute(hConsole, reset_attr); } } jpk::color_t jpk::color::black_f(jpk::color::BLACK_F); jpk::color_t jpk::color::red_f(jpk::color::RED_F); jpk::color_t jpk::color::green_f(jpk::color::GREEN_F); jpk::color_t jpk::color::brown_f(jpk::color::BROWN_F); jpk::color_t jpk::color::blue_f(jpk::color::BLUE_F); jpk::color_t jpk::color::magenta_f(jpk::color::MAGENTA_F); jpk::color_t jpk::color::cyan_f(jpk::color::CYAN_F); jpk::color_t jpk::color::grey_f(jpk::color::GREY_F); jpk::color_t jpk::color::dark_grey_f(jpk::color::DARKGREY_F); jpk::color_t jpk::color::light_red_f(jpk::color::LIGHTRED_F); jpk::color_t jpk::color::light_green_f(jpk::color::LIGHTGREEN_F); jpk::color_t jpk::color::yellow_f(jpk::color::YELLOW_F); jpk::color_t jpk::color::light_blue_f(jpk::color::LIGHTBLUE_F); jpk::color_t jpk::color::light_magenta_f(jpk::color::LIGHTMAGENTA_F); jpk::color_t jpk::color::light_cyan_f(jpk::color::LIGHTCYAN_F); jpk::color_t jpk::color::white_f(jpk::color::WHITE_F); jpk::color_t jpk::color::black_b(jpk::color::BLACK_B); jpk::color_t jpk::color::red_b(jpk::color::RED_B); jpk::color_t jpk::color::green_b(jpk::color::GREEN_B); jpk::color_t jpk::color::yellow_b(jpk::color::YELLOW_B); jpk::color_t jpk::color::blue_b(jpk::color::BLUE_B); jpk::color_t jpk::color::magenta_b(jpk::color::MAGENTA_B); jpk::color_t jpk::color::cyan_b(jpk::color::CYAN_B); jpk::color_t jpk::color::white_b(jpk::color::WHITE_B); jpk::color_t jpk::color::reset(jpk::color::RESET); #else void jpk::color_t::use(std::ostream& out) const { out << seq; } jpk::color_t jpk::color::black_f("\033[22;30m"); jpk::color_t jpk::color::red_f("\033[22;31m"); jpk::color_t jpk::color::green_f("\033[22;32m"); jpk::color_t jpk::color::brown_f("\033[22;33m"); jpk::color_t jpk::color::blue_f("\033[22;34m"); jpk::color_t jpk::color::magenta_f("\033[22;35m"); jpk::color_t jpk::color::cyan_f("\033[22;36m"); jpk::color_t jpk::color::grey_f("\033[22;37m"); jpk::color_t jpk::color::dark_grey_f("\033[01;30m"); jpk::color_t jpk::color::light_red_f("\033[01;31m"); jpk::color_t jpk::color::light_green_f("\033[01;32m"); jpk::color_t jpk::color::yellow_f("\033[01;33m"); jpk::color_t jpk::color::light_blue_f("\033[01;34m"); jpk::color_t jpk::color::light_magenta_f("\033[01;35m"); jpk::color_t jpk::color::light_cyan_f("\033[01;36m"); jpk::color_t jpk::color::white_f("\033[01;37m"); jpk::color_t jpk::color::black_b("\033[40m"); jpk::color_t jpk::color::red_b("\033[41m"); jpk::color_t jpk::color::green_b("\033[42m"); jpk::color_t jpk::color::yellow_b("\033[43m"); jpk::color_t jpk::color::blue_b("\033[44m"); jpk::color_t jpk::color::magenta_b("\033[45m"); jpk::color_t jpk::color::cyan_b("\033[46m"); jpk::color_t jpk::color::white_b("\033[47m"); jpk::color_t jpk::color::reset("\033[0m"); #endif // _WIN32 namespace jpk { std::ostream& operator<<(std::ostream& out, const color_t& col) { col.use(out); return out; } } #endif /* COLOR_H */ 字符串进行正则表达式匹配 从本质上讲,在执行大多数任务时,您不能混合使用不同的对象(bytebyte s)。在上面的代码中,您使用的是二进制搜索字符串和二进制文本,但替换字符串是常规string。所有参数都必须属于同一类型,因此有两种可能的解决方案。

考虑到上述情况,您的代码可能如下所示(这将返回常规string字符串,而不是string个对象):

byte

请注意,代码确实为正则表达式使用了一种特殊类型的字符串 - 一个以with open('/Users/some/directory/title.txt', 'r')as f: text=f.read() letters_only = re.sub(r"[^a-zA-Z]", " ", text) words = letters_only.lower().split() print(words) 为前缀的原始字符串。这意味着python不会解释转义字符,例如r,这对正则表达式非常有用。有关原始字符串的详细信息,请参阅the docs

答案 3 :(得分:0)

您也可以使用 searchParameters,它是 br'…' 的字节模拟。替换也必须是字节串。

r'…'