循环通过一系列的URL的美丽的汤

时间:2016-12-01 05:44:51

标签: python web-scraping beautifulsoup

我正在尝试遍历一系列网址并从公司列表中删除董事会成员。我的循环似乎有一个问题,它只运行数组中的第一个元素并重复结果。任何帮助都将不胜感激。代码:

from bs4 import BeautifulSoup
import requests

#array of URLs to loop through, will be larger once I get the loop working correctly
tickers = ['http://www.reuters.com/finance/stocks/companyOfficers?symbol=AAPL.O', 'http://www.reuters.com/finance/stocks/companyOfficers?symbol=GOOG.O']

board_members = []
output = []
soup = BeautifulSoup(html, "html.parser")

for t in tickers:
    html = requests.get(t).text
    officer_table = soup.find('table', {"class" : "dataTable"})
    for row in officer_table.find_all('tr'):
        cols = row.find_all('td')
        if len(cols) == 4:
            board_members.append((t, cols[0].text.strip(), cols[1].text.strip(), cols[2].text.strip(), cols[3].text.strip()))

        for t, name, age, year_joined, position in board_members:
            output.append(('{} {:35} {} {} {}'.format(t, name, age, year_joined, position)))

1 个答案:

答案 0 :(得分:1)

soup = BeautifulSoup(html, "html.parser")

for t in tickers:
    html = requests.get(t).text
    officer_table = soup.find('table', {"class" : "dataTable"})

你把汤从for循环中取出,这会导致错误,因为当你使用BeautifulSoup(html, "html.parser")时'html'不存在 只需在分配html后将其放入循环中。

for t in tickers:
    html = requests.get(t).text
    soup = BeautifulSoup(html, "html.parser")
    officer_table = soup.find('table', {"class" : "dataTable"})