BeautifulSoup电子邮件提取无法正常工作

时间:2017-03-06 05:54:12

标签: regex python-3.x beautifulsoup

我写了一个基本脚本来从网页中提取电子邮件。

from bs4 import BeautifulSoup
import requests, re

def get_email(url):
    response = requests.get(url, headers={
        'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.107 Safari/537.36',
        'Upgrade-Insecure-Requests': '1', 'x-runtime': '148ms'}, allow_redirects=True).content

    soup = BeautifulSoup(response, "html.parser")

    email = soup(text=re.compile(r'^[a-zA-Z]+[\w\-.]+@[\w-]+\.[\w.-]+[a-zA-Z]')) # this is working with

    print ("email ",email)


get_email('http://www.aberdeenweddingshop.co.uk/contact-us')
get_email('http://www.foodforthoughtdeli.co.uk/contact.htm')

OUTPUT:  
email  info@aberdeenweddingshop.co.uk
email  [] <------------------------#should give info@foodforthoughtdeli.co.uk

它为第一个URL提供了正确的结果,但没有在第二个URL中提取任何内容。我不知道原因。我也试过改变正则表达式。我验证了正则表达式here,但由于某种原因,它无法在代码中运行。

2 个答案:

答案 0 :(得分:1)

在您的第一个案例中,电子邮件是单个范围内的文本。在第二种情况下,电子邮件位于p元素中,其文本比电子邮件多。

你的正则表达式在你的第二个上是不匹配的,因为你正在搜索字符串的开头以及在给定上下文中无效的字符。

您必须在字符串中找到 的电子邮件,然后将其解压缩。 例如:

from bs4 import BeautifulSoup
import requests, re

def get_email(url):
    response = requests.get(url, headers={
    'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.107 Safari/537.36',
    'Upgrade-Insecure-Requests': '1', 'x-runtime': '148ms'}, allow_redirects=True).content

    soup = BeautifulSoup(response, "html.parser")

    email = soup(text=re.compile(r'[A-Za-z0-9\.\+_-]+@[A-Za-z0-9\._-]+\.[a-zA-Z]*'))

    _emailtokens = str(email).replace("\\t", "").replace("\\n", "").split(' ')

    if len(_emailtokens):
        print([match.group(0) for token in _emailtokens for match in [re.search(r"([a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+)", str(token.strip()))] if match])


get_email('http://www.aberdeenweddingshop.co.uk/contact-us')
get_email('http://www.foodforthoughtdeli.co.uk/contact.htm')

输出:

  

[ 'info@aberdeenweddingshop.co.uk']

     

[ 'info@foodforthoughtdeli.co.uk']

答案 1 :(得分:1)

缺少与第二个URL的匹配是由于插入符号( SELECT all._id AS _id FROM [mytable] AS all JOIN EACH ( SELECT _id, MAX(updatedOn) AS updatedOn FROM [mytable] GROUP EACH BY _id) AS latest ON all._id = latest._id AND all.updatedOn = latest.updatedOn WHERE AND(NOT REGEXP_MATCH (GROUP_CONCAT(all.tags), '(query)')) )要求正则表达式在开头。如果省略插入符号,则获得以下内容:

^

由于我们使用正则表达式匹配响应中的字符串,我们并没有真正使用Beautiful Soup的好部分,它可以完全省略:

>>> soup(text=re.compile(r'[a-zA-Z]+[\w\-.]+@[\w-]+\.[\w.-]+[a-zA-Z]'))
['E-mail: \n\t\t\t\t\t\t\t\t\t\t\t\t\tinfo@foodforthoughtdeli.co.uk\n\t\t\t\t\t\t\t\t\t\t\t\t\t']

注意:我使用响应对象的def get_email(url): response = requests.get(url, headers={ 'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.107 Safari/537.36', 'Upgrade-Insecure-Requests': '1', 'x-runtime': '148ms'}, allow_redirects=True).content response = requests.get(url, headers = headers, allow_redirects=True).text email_address = re.search(r'[a-zA-Z]+[\w\-.]+@[\w-]+\.[\w.-]+[a-zA-Z]', response).group() print(email_address) 属性来处理字符串表示 - 而不是使用text属性返回的字节流。