Question

我发现以下代码可抓取网站（我认为所有网站）中的电子邮件

--------------------------------------------------------------------------------
  ^                        the beginning of the string
--------------------------------------------------------------------------------
  (?!                      look ahead to see if there is not:
--------------------------------------------------------------------------------
    .*                       any character except \n (0 or more times
                             (matching the most amount possible))
--------------------------------------------------------------------------------
    '                        '\''
--------------------------------------------------------------------------------
    [A-Za-z]+                any character of: 'A' to 'Z', 'a' to 'z'
                             (1 or more times (matching the most
                             amount possible))
--------------------------------------------------------------------------------
    '                        '\''
--------------------------------------------------------------------------------
  )                        end of look-ahead
--------------------------------------------------------------------------------
  \s*                      whitespace (\n, \r, \t, \f, and " ") (0 or
                           more times (matching the most amount
                           possible))
--------------------------------------------------------------------------------
  [A-Z]+                   any character of: 'A' to 'Z' (1 or more
                           times (matching the most amount possible))
--------------------------------------------------------------------------------
  (?:                      group, but do not capture (0 or more times
                           (matching the most amount possible)):
--------------------------------------------------------------------------------
    ['-]?                    any character of: ''', '-' (optional
                             (matching the most amount possible))
--------------------------------------------------------------------------------
    [a-z]+                   any character of: 'a' to 'z' (1 or more
                             times (matching the most amount
                             possible))
--------------------------------------------------------------------------------
  )*                       end of grouping
--------------------------------------------------------------------------------
  (?:                      group, but do not capture (0 or more times
                           (matching the most amount possible)):
--------------------------------------------------------------------------------
    \s*                      whitespace (\n, \r, \t, \f, and " ") (0
                             or more times (matching the most amount
                             possible))
--------------------------------------------------------------------------------
    [a-z]*                   any character of: 'a' to 'z' (0 or more
                             times (matching the most amount
                             possible))
--------------------------------------------------------------------------------
  )*                       end of grouping
--------------------------------------------------------------------------------
  $                        before an optional \n, and the end of the
                           string

如何修改这样的代码以仅提取一个网页..？我只需要定位一个网页而不是整个网站即可。

Answer 1

只需删除从for anchor in soup.find_all("a"):开始的所有行。然后，您的文档应如下所示：

import re
import requests
import requests.exceptions
from urllib.parse import urlsplit
from collections import deque
from bs4 import BeautifulSoup

# starting url. replace google with your own url.
starting_url = 'http://www.miet.ac.in'

# a queue of urls to be crawled
unprocessed_urls = deque([starting_url])

# set of already crawled urls for email
processed_urls = set()

# a set of fetched emails
emails = set()

# process urls one by one from unprocessed_url queue until queue is empty
while len(unprocessed_urls):

    # move next url from the queue to the set of processed urls
    url = unprocessed_urls.popleft()
    processed_urls.add(url)

    # extract base url to resolve relative links
    parts = urlsplit(url)
    base_url = "{0.scheme}://{0.netloc}".format(parts)
    path = url[:url.rfind('/')+1] if '/' in parts.path else url

    # get url's content
    print("Crawling URL %s" % url)
    try:
        response = requests.get(url)
    except (requests.exceptions.MissingSchema, requests.exceptions.ConnectionError):
        # ignore pages with errors and continue with next url
        continue

    # extract all email addresses and add them into the resulting set
    # You may edit the regular expression as per your requirement
    new_emails = set(re.findall(r"[a-z0-9\.\-+_]+@[a-z0-9\.\-+_]+\.[a-z]+", response.text, re.I))
    emails.update(new_emails)
    print(emails)
    # create a beutiful soup for the html document
    soup = BeautifulSoup(response.text, 'lxml')

要使用Python生成随机的电子邮件地址，请使用以下方法：

from faker import Faker

faker = Faker()

for i in range(12):
    print(f'{faker.email()}')

从Python网页中提取电子邮件

1 个答案: