Python,格式化re.findall()输出

时间:2013-11-11 09:33:01

标签: python regex python-3.x

我正试图在Python中掌握正则表达式。我正在编写一个非常简单的脚本来刮取给定URL上的电子邮件。

import re
from urllib.request import *


url = input("Please insert the URL you wish to scrape> ")

page = urlopen(url)

content = page.read()

email_string = b'[a-z0-9_. A-Z]*@[a-z0-9_. A-Z]*.[a-zA-Z]'

emails_in_page = re.findall(email_string, content)

print("Here are the emails found: ")

for email in emails_in_page:
    print(email)

re.findall()返回一个列表,当程序打印出电子邮件时,正则表达式字符串中的“b”包含在输出中,如下所示:

b'email1@email.com'
b'email2@email.com'
...

如何打印出干净的电子邮件列表? (即:email1@email.com

1 个答案:

答案 0 :(得分:2)

您正在打印bytes个对象。将它们解码为字符串:

encoding = page.headers.get_param('charset')
if encoding is None:
    encoding = 'utf8'  # sensible default

for email in emails_in_page:
    print(email.decode(encoding))

或解码您检索到的HTML页面:

encoding = page.headers.get_param('charset')
if encoding is None:
    encoding = 'utf8'  # sensible default

content = page.read().decode(encoding)

并使用unicode字符串正则表达式:

email_string = '[a-z0-9_. A-Z]*@[a-z0-9_. A-Z]*.[a-zA-Z]'

许多网页都没有在内容类型标题中发送正确的字符集参数,或者设置错误,所以即使是“明智的默认值”也可能不时出错。

BeautifulSoup这样的HTML解析库可以更好地完成编解码器检测,它包含了一些更具启发性的方法来做出有根据的猜测:

from bs4 import BeautifulSoup

soup = BeautifulSoup(page.read(), from_encoding=page.headers.get_param('charset'))
for textelem in soup.find_all(text=re.compile(email_string)):
    print(textelem)