通过网络抓取python提取电子邮件无法正常工作

时间:2014-03-29 10:31:52

标签: python web-scraping beautifulsoup

有人可以帮我编写代码的一部分,通过BeautifulSoup从以下HTML中提取电子邮件吗?我试过了

  1. select方法
  2. find方法
  3. find_all方法
  4. HTML:

    <div id="google_ads_div_990x50-Top_Bar-Classified_Detail_ad_wrapper">
    <div id="google_ads_div_990x50-Top_Bar-Classified_Detail_ad_container" style="display:inline-block;">
    <div id="top-bar-branding">
    <div id="top-bar-branding-logo" style="margin-right:20px margin-left:6px">
    <div id="top-bar-branding-text" style="color:#000; font-size:14px; font-weight:bold; width:450px; text-align:center">As we promised</div>
    <div id="top-bar-branding-extra" style="color:#000; font-size:14px; font-weight:bold;">
    <span style="color:#444; font-weight:normal;">Telephone </span>
    04 451 3111
    <span style="color:#444; font-weight:normal;">or email </span>
    <span style="color:#cf3023;"> info@home4all.ae</span>
    </div>
    </div>
    </div>
    </div>
    </div>
    </div>
    

    enter image description here

    我正在尝试这个,但提供了一个空列表,[]

    email=soup.select("div #top-bar-branding-extra color:#cf3023;")
    print email 
    

    这也不起作用:

    div = soup.find("div", {"id":"top-bar-branding-extra"})
    span = div.find("span", {"style":"color:#cf3023;"})
    print span.string
    

1 个答案:

答案 0 :(得分:1)

.select()方法只接受CSS 选择器(标记名称,ID,类和其他CSS选择器语法),而不是整个CSS 声明(无内容style属性的);你要搜索:

soup.select('div#top-bar-branding-extra span')

因为您无法在此处使用CSS搜索style属性。然后,您可以进一步筛选匹配的元素:

for span in soup.select('div#top-bar-branding-extra span'):
    if span.get('style') == 'color:#cf3023;':
        email = span.text
        break

或使其成为生成器表达式,默认为None

email = next((s.text for s in soup.select('div#top-bar-branding-extra span')
              if s.get('style') == 'color:#cf3023;'), None)

但您需要查看实际的页面源(而不是浏览器DOM表示),以查看它是否与实际属性文本足够匹配。

如果您发布的HTML源代码准确无误,则上述内容有效:

>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup('''\
... <div id="google_ads_div_990x50-Top_Bar-Classified_Detail_ad_wrapper">
... <div id="google_ads_div_990x50-Top_Bar-Classified_Detail_ad_container" style="display:inline-block;">
... <div id="top-bar-branding">
... <div id="top-bar-branding-logo" style="margin-right:20px margin-left:6px">
... <div id="top-bar-branding-text" style="color:#000; font-size:14px; font-weight:bold; width:450px; text-align:center">As we promised</div>
... <div id="top-bar-branding-extra" style="color:#000; font-size:14px; font-weight:bold;">
... <span style="color:#444; font-weight:normal;">Telephone </span>
... 04 451 3111
... <span style="color:#444; font-weight:normal;">or email </span>
... <span style="color:#cf3023;"> info@home4all.ae</span>
... </div>
... </div>
... </div>
... </div>
... </div>
... </div>
... ''')
>>> for span in soup.select('div#top-bar-branding-extra span'):
...     if span.get('style') == 'color:#cf3023;':
...         email = span.text
...         break
... 
>>> email
u' info@home4all.ae'
>>> email = next((s.text for s in soup.select('div#top-bar-branding-extra span')
...               if s.get('style') == 'color:#cf3023;'), None)
>>> email
u' info@home4all.ae'

请注意,这需要从您的网址加载的实际源包含此结构。根据HTML判断,您尝试在页面上加载来自Google广告的电子邮件,该电子邮件始终通过JavaScript 加载,并且不属于原始来源。

您必须分析Google如何加载广告并在Python中复制广告,或使用完整的网络客户端(如ghost或硒驱动的浏览器)来执行Javascript,检索生成的DOM然后解析那个