我怎样才能最好地将2个不同的未标记的html片段用漂亮的汤打印成CSV?

时间:2013-12-12 02:34:43

标签: python html html-parsing beautifulsoup scrape

前言,我是初学者,这是我第一次使用BeautifulSoup。任何意见都非常感谢。

我正试图从this site抓取所有公司名称和电子邮件地址。有3层链接可以爬行(按字母顺序排列的分页列表 - >公司列表按字母 - >公司详细信息页面),然后我将它们打印到csv。

到目前为止,我已经能够使用下面的代码隔离按字母顺序排列的链接列表,但是当我试图隔离不同的公司页面然后从未标记的html中提取名称/电子邮件时,我就陷入了困境。

import re
import urllib2
from bs4 import BeautifulSoup

page = urllib2.urlopen('http://www.indiainfoline.com/Markets/Company/A.aspx').read()
soup = BeautifulSoup(page)
soup.prettify()

pattern = re.compile(r'^\/Markets\/Company\/\D\.aspx$')

all_links = []
navigation_links = []
root = "http://www.indiainfoline.com/"

# Finding all links
for anchor in soup.findAll('a', href=True):
    all_links.append(anchor['href'])
# Isolate links matching regex
for link in all_links:
    if re.match(pattern, link):
        navigation_links.append(root + re.match(pattern, link).group(0))
navigation_links = list(set(navigation_links))

company_pages = []
for page in navigation_links:
    for anchor in soup.findAll('table', id='AlphaQuotes1_Rep_quote')              [0].findAll('a',href=True):
        company_pages.append(root + anchor['href'])

1 个答案:

答案 0 :(得分:0)

分段。获取每个公司的链接很简单:

from bs4 import BeautifulSoup
import requests

html = requests.get('http://www.indiainfoline.com/Markets/Company/A.aspx').text
bs = BeautifulSoup(html)

# find the links to companies
company_menu = bs.find("div",{'style':'padding-left:5px'})
# print all companies links
companies = company_menu.find_all('a')
for company in companies:
    print company['href']

其次,获取公司名称:

for company in companies:
    print company.getText().strip()

第三,电子邮件稍微复杂一些,但您可以在此处使用正则表达式,因此在独立的公司页面中,请执行以下操作:

import re
# example company page
html = requests.get('http://www.indiainfoline.com/Markets/Company/Adani-Power-Ltd/533096').text
EMAIL_REGEX = re.compile("mailto:([A-Za-z0-9.\-+]+@[A-Za-z0-9_\-]+[.][a-zA-Z]{2,4})")
re.findall(EMAIL_REGEX, html)
# and there you got a list of found emails
...

干杯,