我正在尝试检索div class =“ipo-cell-height”中的日期以及公司名称,例如2/21/2014和Sundance Energy Australia。这是网站的链接http://www.nasdaq.com/markets/ipos/这是html。这段代码包含第二个div class =“genTable thin floatL”style =“width:315px”
<div class="genTable thin floatL" style="width:315px">
<h3 class="table-headtag">Upcoming IPOs</h3>
<table><tbody>
<tr>
<td><div class="ipo-cell-height">2/21/2014</div></td>
<td><div class="ipo-cell-height"><a id="two_column_main_content_rpt_expected_company_0" href="http://www.nasdaq.com/markets/ipos/company/sundance-energy-australia-ltd-672724-74237">SUNDANCE ENERGY AUSTRALIA LTD</a></div></td>
</tr>
<tr>
<td><div class="ipo-cell-height">2/14/2014</div></td>
<td><div class="ipo-cell-height"><a id="two_column_main_content_rpt_expected_company_1" href="http://www.nasdaq.com/markets/ipos/company/inogen-inc-639597-74090">INOGEN INC</a></div></td>
</tr>
<tr>
<td><div class="ipo-cell-height">2/14/2014</div></td>
<td><div class="ipo-cell-height"><a id="two_column_main_content_rpt_expected_company_2" href="http://www.nasdaq.com/markets/ipos/company/semler-scientific-inc-920476-73980">SEMLER SCIENTIFIC, INC.</a></div></td>
</tr>
<tr>
<td><div class="ipo-cell-height">10/9/2013</div></td>
<td><div class="ipo-cell-height"><a id="two_column_main_content_rpt_expected_company_3" href="http://www.nasdaq.com/markets/ipos/company/sfx-entertainment-inc-885264-73081">SFX ENTERTAINMENT, INC</a></div></td>
</tr>
</tbody></table>
我正在使用的代码有beautifulsoup,我认为它需要parent.name或.contents。此代码只打印前10个内容。我以为我可以使用div类作为parent.name,但是“tbody”行不起作用。
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("http://www.nasdaq.com/markets/ipos/")
soup = BeautifulSoup(html)
for data in soup.find_all('td') [0:10]:
if data.parent.name == "tr":
# if data.parent.name == "tbody": #This line makes it not print anything
print (data.text)
答案 0 :(得分:1)
一种方法可能是使用值为<div>
的{{1}}属性遍历所有class
元素,检查其文本是否与使用正则表达式的日期匹配,然后查找下一个{{ 1}}元素并打印两个元素的文本。
ipo-cell-height
像以下一样运行:
<div>
产量:
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
html = urlopen("http://www.nasdaq.com/markets/ipos/").read()
soup = BeautifulSoup(html)
for div in soup.find_all('div', attrs={'class':'ipo-cell-height'}):
s = div.string
if re.match(r'\d{1,2}/\d{1,2}/\d{4}$', s):
div_next = div.find_next('div')
print('{} - {}'.format(s, div_next.string))
答案 1 :(得分:0)
您可以根据他们的css类创建div
的列表,但这是使用requests
和BeautifulSoup3
:
import requests
from BeautifulSoup import BeautifulSoup
req = requests.get('http://nasdaq.com/markets/ipos')
soup = BeautifulSoup(req.content)
ipo_divs = soup.findAll('div', {'class':'genTable thin floatL'})[0]
c = ipo_divs.findAll('div', {'class':'ipo-cell-height'})
ipos = {c[i].text:c[i + 1].text for i in xrange(0, len(c) - 1, 2)}