我发现以下代码可抓取网站(我认为所有网站)中的电子邮件
--------------------------------------------------------------------------------
^ the beginning of the string
--------------------------------------------------------------------------------
(?! look ahead to see if there is not:
--------------------------------------------------------------------------------
.* any character except \n (0 or more times
(matching the most amount possible))
--------------------------------------------------------------------------------
' '\''
--------------------------------------------------------------------------------
[A-Za-z]+ any character of: 'A' to 'Z', 'a' to 'z'
(1 or more times (matching the most
amount possible))
--------------------------------------------------------------------------------
' '\''
--------------------------------------------------------------------------------
) end of look-ahead
--------------------------------------------------------------------------------
\s* whitespace (\n, \r, \t, \f, and " ") (0 or
more times (matching the most amount
possible))
--------------------------------------------------------------------------------
[A-Z]+ any character of: 'A' to 'Z' (1 or more
times (matching the most amount possible))
--------------------------------------------------------------------------------
(?: group, but do not capture (0 or more times
(matching the most amount possible)):
--------------------------------------------------------------------------------
['-]? any character of: ''', '-' (optional
(matching the most amount possible))
--------------------------------------------------------------------------------
[a-z]+ any character of: 'a' to 'z' (1 or more
times (matching the most amount
possible))
--------------------------------------------------------------------------------
)* end of grouping
--------------------------------------------------------------------------------
(?: group, but do not capture (0 or more times
(matching the most amount possible)):
--------------------------------------------------------------------------------
\s* whitespace (\n, \r, \t, \f, and " ") (0
or more times (matching the most amount
possible))
--------------------------------------------------------------------------------
[a-z]* any character of: 'a' to 'z' (0 or more
times (matching the most amount
possible))
--------------------------------------------------------------------------------
)* end of grouping
--------------------------------------------------------------------------------
$ before an optional \n, and the end of the
string
如何修改这样的代码以仅提取一个网页..?我只需要定位一个网页而不是整个网站即可。
答案 0 :(得分:1)
只需删除从for anchor in soup.find_all("a"):
开始的所有行。然后,您的文档应如下所示:
import re
import requests
import requests.exceptions
from urllib.parse import urlsplit
from collections import deque
from bs4 import BeautifulSoup
# starting url. replace google with your own url.
starting_url = 'http://www.miet.ac.in'
# a queue of urls to be crawled
unprocessed_urls = deque([starting_url])
# set of already crawled urls for email
processed_urls = set()
# a set of fetched emails
emails = set()
# process urls one by one from unprocessed_url queue until queue is empty
while len(unprocessed_urls):
# move next url from the queue to the set of processed urls
url = unprocessed_urls.popleft()
processed_urls.add(url)
# extract base url to resolve relative links
parts = urlsplit(url)
base_url = "{0.scheme}://{0.netloc}".format(parts)
path = url[:url.rfind('/')+1] if '/' in parts.path else url
# get url's content
print("Crawling URL %s" % url)
try:
response = requests.get(url)
except (requests.exceptions.MissingSchema, requests.exceptions.ConnectionError):
# ignore pages with errors and continue with next url
continue
# extract all email addresses and add them into the resulting set
# You may edit the regular expression as per your requirement
new_emails = set(re.findall(r"[a-z0-9\.\-+_]+@[a-z0-9\.\-+_]+\.[a-z]+", response.text, re.I))
emails.update(new_emails)
print(emails)
# create a beutiful soup for the html document
soup = BeautifulSoup(response.text, 'lxml')
要使用Python生成随机的电子邮件地址,请使用以下方法:
from faker import Faker
faker = Faker()
for i in range(12):
print(f'{faker.email()}')