Question

我正在尝试从this site's内部按字母顺序排列的索引中检索所有电子邮件地址。

基本上，我正在寻找一种方法来使用BSoup首先浏览所有不同的字母链接，然后浏览每个公司页面以打印所有相应的电子邮件地址。

我已经能够打印网站上所有公司的列表，但我不确定如何迭代其他级别的链接。我考虑过使用字典并分别为每个字母创建密钥，但我似乎无法让它工作。

这是迄今为止成功提取所有公司名称的代码，以及逐个单独提取电子邮件地址的正则表达式。如何一次打印所有电子邮件地址？

赞赏任何意见。

from bs4 import BeautifulSoup
import requests

alphabet = ['a','b','c','d','e','f','g','h','i','j','k','l','m','n','o','p','q','r','s','t','u','v','w','x','y','z']
#alphabet = ['a']

resultsdict = {}
companyname = []
url1 = 'http://www.indiainfoline.com/Markets/Company/'
url2 = '.aspx'
for element in alphabet:
    html = requests.get(url1 + element + url2).text
    bs = BeautifulSoup(html)
    # find the links to companies
    company_menu = bs.find("div",{'style':'padding-left:5px'})
    # print all companies links
    companies = company_menu.find_all('a')
    for company in companies:
        print company.getText().strip()







import re
# example company page
html = requests.get('http://www.indiainfoline.com/Markets/Company/Adani-Power-    Ltd/533096').text
EMAIL_REGEX = re.compile("mailto:([A-Za-z0-9.\-+]+@[A-Za-z0-9_\-]+[.][a-zA-Z]{2,4})")
re.findall(EMAIL_REGEX, html)

Answer 1

执行大量网络抓取工作的人的建议：使用公司链接制作循环，打开页面并获取它找到的电子邮件（或您希望的任何数据）。我只在页面上看到一个电子邮件链接，所以它找到的那个。一个粗略的例子：

for company in companies:
    company_html = requests.get(company['href'])
    company_bs = BeautifulSoup(company_html)
    company_page_links = company_bs('a')
    for link in company_page_links:
        if link['href'].startswith('mailto:'):
            #You found the e-mail address!
            break#Exits the loop, as you already found the address

如何使用BeautifulSoup迭代站点上的多个内部链接以输出所有电子邮件地址？

1 个答案: