我写了一个基本脚本来从网页中提取电子邮件。
from bs4 import BeautifulSoup
import requests, re
def get_email(url):
response = requests.get(url, headers={
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.107 Safari/537.36',
'Upgrade-Insecure-Requests': '1', 'x-runtime': '148ms'}, allow_redirects=True).content
soup = BeautifulSoup(response, "html.parser")
email = soup(text=re.compile(r'^[a-zA-Z]+[\w\-.]+@[\w-]+\.[\w.-]+[a-zA-Z]')) # this is working with
print ("email ",email)
get_email('http://www.aberdeenweddingshop.co.uk/contact-us')
get_email('http://www.foodforthoughtdeli.co.uk/contact.htm')
OUTPUT:
email info@aberdeenweddingshop.co.uk
email [] <------------------------#should give info@foodforthoughtdeli.co.uk
它为第一个URL提供了正确的结果,但没有在第二个URL中提取任何内容。我不知道原因。我也试过改变正则表达式。我验证了正则表达式here,但由于某种原因,它无法在代码中运行。
答案 0 :(得分:1)
在您的第一个案例中,电子邮件是单个范围内的文本。在第二种情况下,电子邮件位于p
元素中,其文本比电子邮件多。
你的正则表达式在你的第二个上是不匹配的,因为你正在搜索字符串的开头以及在给定上下文中无效的字符。
您必须在字符串中找到 的电子邮件,然后将其解压缩。 例如:
from bs4 import BeautifulSoup
import requests, re
def get_email(url):
response = requests.get(url, headers={
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.107 Safari/537.36',
'Upgrade-Insecure-Requests': '1', 'x-runtime': '148ms'}, allow_redirects=True).content
soup = BeautifulSoup(response, "html.parser")
email = soup(text=re.compile(r'[A-Za-z0-9\.\+_-]+@[A-Za-z0-9\._-]+\.[a-zA-Z]*'))
_emailtokens = str(email).replace("\\t", "").replace("\\n", "").split(' ')
if len(_emailtokens):
print([match.group(0) for token in _emailtokens for match in [re.search(r"([a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+)", str(token.strip()))] if match])
get_email('http://www.aberdeenweddingshop.co.uk/contact-us')
get_email('http://www.foodforthoughtdeli.co.uk/contact.htm')
输出:
[ 'info@aberdeenweddingshop.co.uk']
[ 'info@foodforthoughtdeli.co.uk']
答案 1 :(得分:1)
缺少与第二个URL的匹配是由于插入符号( SELECT
all._id AS _id
FROM
[mytable] AS all
JOIN EACH (
SELECT
_id,
MAX(updatedOn) AS updatedOn
FROM
[mytable]
GROUP EACH BY
_id) AS latest
ON
all._id = latest._id
AND all.updatedOn = latest.updatedOn
WHERE
AND(NOT REGEXP_MATCH (GROUP_CONCAT(all.tags), '(query)'))
)要求正则表达式在开头。如果省略插入符号,则获得以下内容:
^
由于我们使用正则表达式匹配响应中的字符串,我们并没有真正使用Beautiful Soup的好部分,它可以完全省略:
>>> soup(text=re.compile(r'[a-zA-Z]+[\w\-.]+@[\w-]+\.[\w.-]+[a-zA-Z]'))
['E-mail: \n\t\t\t\t\t\t\t\t\t\t\t\t\tinfo@foodforthoughtdeli.co.uk\n\t\t\t\t\t\t\t\t\t\t\t\t\t']
注意:我使用响应对象的def get_email(url):
response = requests.get(url, headers={
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.107 Safari/537.36',
'Upgrade-Insecure-Requests': '1', 'x-runtime': '148ms'}, allow_redirects=True).content
response = requests.get(url, headers = headers, allow_redirects=True).text
email_address = re.search(r'[a-zA-Z]+[\w\-.]+@[\w-]+\.[\w.-]+[a-zA-Z]', response).group()
print(email_address)
属性来处理字符串表示 - 而不是使用text
属性返回的字节流。