Question

我正在使用Python 3.3和本网站： http://www.nasdaq.com/markets/ipos/

我的目标是只阅读即将上市的公司。它在div标签中，div class =“genTable thin floatL”这个类有两个，目标数据在第一个。

这是我的代码

from urllib.request import urlopen
from bs4 import BeautifulSoup
import re

html = urlopen("http://www.nasdaq.com/markets/ipos/").read()
soup = BeautifulSoup(html)
for divparent in soup.find_all('div', attrs={'class':'genTable thin floatL'}) [0]: # I tried putting a [0] so it will only return divs in the first genTable thin floatL class
    for div in soup.find_all('div', attrs={'class':'ipo-cell-height'}):
        s = div.string
        if re.match(r'\d{1,2}/\d{1,2}/\d{4}$', s):
            div_next = div.find_next('div')
            print('{} - {}'.format(s, div_next.string))

我希望它只返回

3/7/2014 - RECRO PHARMA, INC.
2/28/2014 - VARONIS SYSTEMS INC
2/27/2014 - LUMENIS LTD
2/21/2014 - SUNDANCE ENERGY AUSTRALIA LTD
2/21/2014 - SEMLER SCIENTIFIC, INC.

但它会使用re.match规范打印所有div类，也会多次打印。我尝试在for divparent循环上插入[0]以仅检索第一个，但这会导致重复问题。

编辑：这是根据warunsl解决方案的更新代码。这很有效。

from urllib.request import urlopen
from bs4 import BeautifulSoup
import re

html = urlopen("http://www.nasdaq.com/markets/ipos/").read()
soup = BeautifulSoup(html)

divparent = soup.find_all('div', attrs={'class':'genTable thin floatL'})[0]
table= divparent.find('table')
for div in table.find_all('div', attrs={'class':'ipo-cell-height'}):
        s = div.string
        if re.match(r'\d{1,2}/\d{1,2}/\d{4}$', s):
            div_next = div.find_next('div')
            print('{} - {}'.format(s, div_next.string))

Answer 1

您提到有两个符合'class':'genTable thin floatL'条件的元素。所以为它的第一个元素运行for循环是没有意义的。

所以用

替换你的外部for循环

divparent = soup.find_all('div', attrs={'class':'genTable thin floatL'})[0]

现在您无需再次soup.find_all。这样做将搜索整个文档。您需要将搜索范围限制为divparent。所以，你这样做：

table = divparent.find('table')

提取日期和公司名称的其余代码将是相同的，除了它们将参考table变量。

for row in table.find_all('tr'):
    for data in row.find_all('td'):
        print data.string

希望它有所帮助。

Python web抓取并获取其类的第一个div标签的内容

1 个答案: