Python webscraper和父名称问题

时间:2014-02-15 22:45:26

标签: python web-scraping beautifulsoup python-3.3

我正在尝试检索div class =“ipo-cell-height”中的日期以及公司名称,例如2/21/2014和Sundance Energy Australia。这是网站的链接http://www.nasdaq.com/markets/ipos/这是html。这段代码包含第二个div class =“genTable thin floatL”style =“width:315px”

<div class="genTable thin floatL" style="width:315px">
                <h3 class="table-headtag">Upcoming IPOs</h3>
                <table><tbody>
                    <tr>
                        <td><div class="ipo-cell-height">2/21/2014</div></td>
                        <td><div class="ipo-cell-height"><a id="two_column_main_content_rpt_expected_company_0" href="http://www.nasdaq.com/markets/ipos/company/sundance-energy-australia-ltd-672724-74237">SUNDANCE ENERGY AUSTRALIA LTD</a></div></td>
                    </tr>

                    <tr>
                        <td><div class="ipo-cell-height">2/14/2014</div></td>
                        <td><div class="ipo-cell-height"><a id="two_column_main_content_rpt_expected_company_1" href="http://www.nasdaq.com/markets/ipos/company/inogen-inc-639597-74090">INOGEN INC</a></div></td>
                    </tr>

                    <tr>
                        <td><div class="ipo-cell-height">2/14/2014</div></td>
                        <td><div class="ipo-cell-height"><a id="two_column_main_content_rpt_expected_company_2" href="http://www.nasdaq.com/markets/ipos/company/semler-scientific-inc-920476-73980">SEMLER SCIENTIFIC, INC.</a></div></td>
                    </tr>

                    <tr>
                        <td><div class="ipo-cell-height">10/9/2013</div></td>
                        <td><div class="ipo-cell-height"><a id="two_column_main_content_rpt_expected_company_3" href="http://www.nasdaq.com/markets/ipos/company/sfx-entertainment-inc-885264-73081">SFX ENTERTAINMENT, INC</a></div></td>
                    </tr>
                </tbody></table>

我正在使用的代码有beautifulsoup,我认为它需要parent.name或.contents。此代码只打印前10个内容。我以为我可以使用div类作为parent.name,但是“tbody”行不起作用。

from urllib.request import urlopen
from bs4 import BeautifulSoup

html = urlopen("http://www.nasdaq.com/markets/ipos/")
soup = BeautifulSoup(html)
for data in soup.find_all('td') [0:10]: 
     if data.parent.name == "tr":
#      if data.parent.name ==  "tbody": #This line makes it not print anything
            print (data.text)

2 个答案:

答案 0 :(得分:1)

一种方法可能是使用值为<div>的{​​{1}}属性遍历所有class元素,检查其文本是否与使用正则表达式的日期匹配,然后查找下一个{{ 1}}元素并打印两个元素的文本。

ipo-cell-height

像以下一样运行:

<div>

产量:

from urllib.request import urlopen
from bs4 import BeautifulSoup
import re

html = urlopen("http://www.nasdaq.com/markets/ipos/").read()
soup = BeautifulSoup(html)
for div in soup.find_all('div', attrs={'class':'ipo-cell-height'}):
    s = div.string
    if re.match(r'\d{1,2}/\d{1,2}/\d{4}$', s): 
        div_next = div.find_next('div')
        print('{} - {}'.format(s, div_next.string))

答案 1 :(得分:0)

您可以根据他们的css类创建div的列表,但这是使用requestsBeautifulSoup3

import requests
from BeautifulSoup import BeautifulSoup

req = requests.get('http://nasdaq.com/markets/ipos')
soup = BeautifulSoup(req.content)

ipo_divs = soup.findAll('div', {'class':'genTable thin floatL'})[0]
c = ipo_divs.findAll('div', {'class':'ipo-cell-height'})

ipos = {c[i].text:c[i + 1].text for i in xrange(0, len(c) - 1, 2)}