我正在尝试使用Python 3中的正则表达式从文本文件中的wiki标题转储中提取英文标题.Wiki转储包含其他语言的标题和一些符号。以下是我的代码:
with open('/Users/some/directory/title.txt', 'rb')as f:
text=f.read()
letters_only = re.sub(b"[^a-zA-Z]", " ", text)
words = letters_only.lower().split()
print(words)
但是我收到了一个错误:
TypeError: sequence item 1: expected a bytes-like object, str found
行:letters_only = re.sub(b"[^a-zA-Z]", " ", text)
但是,我使用b''
将输出作为字节类型,下面是文本文件的示例:
Destroy-Oh-Boy!!
!!Que_Corra_La_Voz!!
!!_(chess)
!!_(disambiguation)
!'O!Kung
!'O!Kung_language
!'O-!khung_language
!337$P34K
!=
!?
!?!
!?Revolution!?
!?_(chess)
!A_Luchar!
!Action_Pact!
!Action_pact!
!Adios_Amigos!
!Alabadle!
!Alarma!
!Alarma!_(album)
!Alarma!_(disambiguation)
!Alarma!_(magazine)
!Alarma!_Records
!Alarma!_magazine
!Alfaro_Vive,_Carajo!
!All-Time_Quarterback!
!All-Time_Quarterback!_(EP)
!All-Time_Quarterback!_(album)
!Alla_tu!
!Amigos!
!Amigos!_(Arrested_Development_episode)
!Arriba!_La_Pachanga
!Ask_a_Mexican!
!Atame!
!Ay,_Carmela!_(film)
!Ay,_caramba!
!BANG!
!Bang!
!Bang!_TV
!Basta_Ya!
!Bastardos!
!Bastardos!_(album)
!Bastardos_en_Vivo!
!Bienvenido,_Mr._Marshall!
!Ciauetistico!
!Ciautistico!
!DOCTYPE
!Dame!_!Dame!_!Dame!
!Decapitacion!
!Dos!
!Explora!_Science_Center_and_Children's_Museum
!F
!Forward,_Russia!
!Forward_Russia!
!Ga!ne_language
!Ga!nge_language
!Gã!ne
!Gã!ne_language
!Gã!nge_language
!HERO
!Happy_Birthday_Guadaloupe!
!Happy_Birthday_Guadalupe!
!Hello_Friends
我在线搜索但无法成功。任何帮助将不胜感激。
答案 0 :(得分:6)
问题在于您提供的repl
参数,它不是bytes
对象:
letters_only = re.sub(b"[^a-zA-Z]", " ", b'Hello2World')
# TypeError: sequence item 1: expected a bytes-like object, str found
相反,提供repl
作为字节实例b" "
:
letters_only = re.sub(b"[^a-zA-Z]", b" ", b'Hello2World')
print(letters_only)
b'Hello World'
注意:如果您没有查找b
序列,请不要在文字前加上rb
,并且不要使用byte
打开文件。
答案 1 :(得分:4)
您必须在二进制和文本模式之间进行选择。
您以rb
打开文件,然后可以使用re.sub(b"[^a-zA-Z]", b" ", text)
(text
是bytes
个对象)
或者您将文件打开为r
,然后您可以使用re.sub("[^a-zA-Z]", " ", text)
(text
是str
个对象)
第二种解决方案更“经典”。
答案 2 :(得分:2)
当替换字符串不是时,您不能使用#ifndef COLOR_H
#define COLOR_H
#include <string>
#include <iostream>
#if defined(_WIN32) && !defined(JPK_USE_ANSI)
#include <windows.h>
#endif // _WIN32
namespace jpk
{
class color_t
{
public:
color_t(const unsigned int& col);
#if !defined(_WIN32) || defined(JPK_USE_ANSI)
color_t(const std::string& esc);
#endif // _WIN32
virtual ~color_t(void);
color_t(const color_t&) = delete;
color_t& operator=(const color_t&) = delete;
void use(std::ostream& out) const;
friend std::ostream& operator<<(std::ostream&, const jpk::color_t&);
private:
#if defined(_WIN32) && !defined(JPK_USE_ANSI)
const unsigned int c;
static bool reset_attr_got;
static WORD reset_attr;
#else
const std::string seq;
#endif // _WIN32
};
struct color
{
enum colors
{
BLACK_F,
BLUE_F,
GREEN_F,
CYAN_F,
RED_F,
MAGENTA_F,
BROWN_F,
GREY_F,
DARKGREY_F,
LIGHTBLUE_F,
LIGHTGREEN_F,
LIGHTCYAN_F,
LIGHTRED_F,
LIGHTMAGENTA_F,
YELLOW_F,
WHITE_F,
BLACK_B,
BLUE_B,
GREEN_B,
CYAN_B,
RED_B,
MAGENTA_B,
YELLOW_B,
WHITE_B,
RESET
};
color(void) = delete;
~color(void) = delete;
static color_t black_f;
static color_t red_f;
static color_t green_f;
static color_t brown_f;
static color_t blue_f;
static color_t magenta_f;
static color_t cyan_f;
static color_t grey_f;
static color_t dark_grey_f;
static color_t light_red_f;
static color_t light_green_f;
static color_t yellow_f;
static color_t light_blue_f;
static color_t light_magenta_f;
static color_t light_cyan_f;
static color_t white_f;
static color_t black_b;
static color_t red_b;
static color_t green_b;
static color_t yellow_b;
static color_t blue_b;
static color_t magenta_b;
static color_t cyan_b;
static color_t white_b;
static color_t reset;
};
}
#if !defined(_WIN32) || defined(JPK_USE_ANSI)
std::string getAnsiEsc(const unsigned int& col)
{
switch(col)
{
case jpk::color::BLACK_F: return "\033[22;30m";
case jpk::color::RED_F: return "\033[22;31m";
case jpk::color::GREEN_F: return "\033[22;32m";
case jpk::color::BROWN_F: return "\033[22;33m";
case jpk::color::BLUE_F: return "\033[22;34m";
case jpk::color::MAGENTA_F: return "\033[22;35m";
case jpk::color::CYAN_F: return "\033[22;36m";
case jpk::color::GREY_F: return "\033[22;37m";
case jpk::color::DARKGREY_F: return "\033[01;30m";
case jpk::color::LIGHTRED_F: return "\033[01;31m";
case jpk::color::LIGHTGREEN_F: return "\033[01;32m";
case jpk::color::YELLOW_F: return "\033[01;33m";
case jpk::color::LIGHTBLUE_F: return "\033[01;34m";
case jpk::color::LIGHTMAGENTA_F: return "\033[01;35m";
case jpk::color::LIGHTCYAN_F: return "\033[01;36m";
case jpk::color::WHITE_F: return "\033[01;37m";
case jpk::color::BLACK_B: return "\033[40m";
case jpk::color::RED_B: return "\033[41m";
case jpk::color::GREEN_B: return "\033[42m";
case jpk::color::YELLOW_B: return "\033[43m";
case jpk::color::BLUE_B: return "\033[44m";
case jpk::color::MAGENTA_B: return "\033[45m";
case jpk::color::CYAN_B: return "\033[46m";
case jpk::color::WHITE_B: return "\033[47m";
case jpk::color::RESET: return "\033[0m";
}
return "";
}
#endif // _WIN32
jpk::color_t::color_t(const unsigned int& col) :
#if defined(_WIN32) && !defined(JPK_USE_ANSI)
c(col)
{
if(!reset_attr_got)
{
CONSOLE_SCREEN_BUFFER_INFO csbi;
GetConsoleScreenBufferInfo(GetStdHandle(STD_OUTPUT_HANDLE), &csbi);
reset_attr = csbi.wAttributes;
reset_attr_got = true;
}
#else
seq(jpk::getAnsiEsc(col))
{}
jpk::color_t::color_t(const std::string& esc) :
seq(esc)
{
#endif // _WIN32
}
jpk::color_t::~color_t(void) {}
#if defined(_WIN32) && !defined(JPK_USE_ANSI)
bool jpk::color_t::reset_attr_got(false);
WORD jpk::color_t::reset_attr(0);
void jpk::color_t::use(std::ostream& out) const
{
if(c <= jpk::color::RESET)
{
HANDLE hConsole(GetStdHandle(STD_OUTPUT_HANDLE));
CONSOLE_SCREEN_BUFFER_INFO csbi;
GetConsoleScreenBufferInfo(hConsole, &csbi);
if(c < jpk::color::BLACK_B)
SetConsoleTextAttribute(hConsole, (csbi.wAttributes & 0xFFF0) | (WORD)c);
else if((c > jpk::color::WHITE_F) && (c < jpk::color::RESET))
SetConsoleTextAttribute(hConsole, (csbi.wAttributes & 0xFF0F) | (((WORD)(c - jpk::color::BLACK_B)) << 4));
else if(c == jpk::color::RESET)
SetConsoleTextAttribute(hConsole, reset_attr);
}
}
jpk::color_t jpk::color::black_f(jpk::color::BLACK_F);
jpk::color_t jpk::color::red_f(jpk::color::RED_F);
jpk::color_t jpk::color::green_f(jpk::color::GREEN_F);
jpk::color_t jpk::color::brown_f(jpk::color::BROWN_F);
jpk::color_t jpk::color::blue_f(jpk::color::BLUE_F);
jpk::color_t jpk::color::magenta_f(jpk::color::MAGENTA_F);
jpk::color_t jpk::color::cyan_f(jpk::color::CYAN_F);
jpk::color_t jpk::color::grey_f(jpk::color::GREY_F);
jpk::color_t jpk::color::dark_grey_f(jpk::color::DARKGREY_F);
jpk::color_t jpk::color::light_red_f(jpk::color::LIGHTRED_F);
jpk::color_t jpk::color::light_green_f(jpk::color::LIGHTGREEN_F);
jpk::color_t jpk::color::yellow_f(jpk::color::YELLOW_F);
jpk::color_t jpk::color::light_blue_f(jpk::color::LIGHTBLUE_F);
jpk::color_t jpk::color::light_magenta_f(jpk::color::LIGHTMAGENTA_F);
jpk::color_t jpk::color::light_cyan_f(jpk::color::LIGHTCYAN_F);
jpk::color_t jpk::color::white_f(jpk::color::WHITE_F);
jpk::color_t jpk::color::black_b(jpk::color::BLACK_B);
jpk::color_t jpk::color::red_b(jpk::color::RED_B);
jpk::color_t jpk::color::green_b(jpk::color::GREEN_B);
jpk::color_t jpk::color::yellow_b(jpk::color::YELLOW_B);
jpk::color_t jpk::color::blue_b(jpk::color::BLUE_B);
jpk::color_t jpk::color::magenta_b(jpk::color::MAGENTA_B);
jpk::color_t jpk::color::cyan_b(jpk::color::CYAN_B);
jpk::color_t jpk::color::white_b(jpk::color::WHITE_B);
jpk::color_t jpk::color::reset(jpk::color::RESET);
#else
void jpk::color_t::use(std::ostream& out) const
{
out << seq;
}
jpk::color_t jpk::color::black_f("\033[22;30m");
jpk::color_t jpk::color::red_f("\033[22;31m");
jpk::color_t jpk::color::green_f("\033[22;32m");
jpk::color_t jpk::color::brown_f("\033[22;33m");
jpk::color_t jpk::color::blue_f("\033[22;34m");
jpk::color_t jpk::color::magenta_f("\033[22;35m");
jpk::color_t jpk::color::cyan_f("\033[22;36m");
jpk::color_t jpk::color::grey_f("\033[22;37m");
jpk::color_t jpk::color::dark_grey_f("\033[01;30m");
jpk::color_t jpk::color::light_red_f("\033[01;31m");
jpk::color_t jpk::color::light_green_f("\033[01;32m");
jpk::color_t jpk::color::yellow_f("\033[01;33m");
jpk::color_t jpk::color::light_blue_f("\033[01;34m");
jpk::color_t jpk::color::light_magenta_f("\033[01;35m");
jpk::color_t jpk::color::light_cyan_f("\033[01;36m");
jpk::color_t jpk::color::white_f("\033[01;37m");
jpk::color_t jpk::color::black_b("\033[40m");
jpk::color_t jpk::color::red_b("\033[41m");
jpk::color_t jpk::color::green_b("\033[42m");
jpk::color_t jpk::color::yellow_b("\033[43m");
jpk::color_t jpk::color::blue_b("\033[44m");
jpk::color_t jpk::color::magenta_b("\033[45m");
jpk::color_t jpk::color::cyan_b("\033[46m");
jpk::color_t jpk::color::white_b("\033[47m");
jpk::color_t jpk::color::reset("\033[0m");
#endif // _WIN32
namespace jpk
{
std::ostream& operator<<(std::ostream& out, const color_t& col)
{
col.use(out);
return out;
}
}
#endif /* COLOR_H */
字符串进行正则表达式匹配
从本质上讲,在执行大多数任务时,您不能混合使用不同的对象(byte
和byte
s)。在上面的代码中,您使用的是二进制搜索字符串和二进制文本,但替换字符串是常规string
。所有参数都必须属于同一类型,因此有两种可能的解决方案。
考虑到上述情况,您的代码可能如下所示(这将返回常规string
字符串,而不是string
个对象):
byte
请注意,代码确实为正则表达式使用了一种特殊类型的字符串 - 一个以with open('/Users/some/directory/title.txt', 'r')as f:
text=f.read()
letters_only = re.sub(r"[^a-zA-Z]", " ", text)
words = letters_only.lower().split()
print(words)
为前缀的原始字符串。这意味着python不会解释转义字符,例如r
,这对正则表达式非常有用。有关原始字符串的详细信息,请参阅the docs。
答案 3 :(得分:0)
您也可以使用 searchParameters
,它是 br'…'
的字节模拟。替换也必须是字节串。
r'…'