如何在一个嵌套的for-loop'中同时刮两页并生成两个不同的列表?

时间:2017-09-21 12:47:34

标签: python python-3.x web-scraping beautifulsoup

我从两个具有相同DOM结构的网址中抓取,因此我试图找到一种方法同时抓取它们。
唯一需要注意的是,从这两个页面中删除的数据最终都需要以明确命名的列表结尾。

通过示例解释,以下是我尝试过的内容:

import os
import requests
from bs4 import BeautifulSoup as bs


urls = ['https://www.basketball-reference.com/leaders/ws_career.html',
       'https://www.basketball-reference.com/leaders/ws_per_48_career.html',]

ws_list = []
ws48_list = []

categories = [ws_list, ws48_list]

for url in urls:
    response = requests.get(url, headers=headers)
    soup = bs(response.content, 'html.parser')
    section = soup.find('table', class_='stats_table')
    for a in section.find_all('a'):
        player_name = a.text
        for cat_list in categories:
            cat_list.append(player_name)
print(ws48_list)
print(ws_list)

当我拍摄其页面独有的2个列表时,最终会打印两个相同的列表 我该如何做到这一点?用另一种方式编码会更好吗?

3 个答案:

答案 0 :(得分:3)

只需将它们添加到相应的列表中,问题就解决了吗?

for i, url in enumerate(urls):
    response = requests.get(url)
    soup = bs(response.content, 'html.parser')
    section = soup.find('table', class_='stats_table')
    for a in section.find_all('a'):
        player_name = a.text
        categories[i].append(player_name)
print(ws48_list)
print(ws_list)

答案 1 :(得分:3)

而不是尝试追加已有的列表。只需创建新的。创建一个函数来进行刮擦并将每个url依次传递给它。

import os
import requests
from bs4 import BeautifulSoup as bs

urls = ['https://www.basketball-reference.com/leaders/ws_career.html',
       'https://www.basketball-reference.com/leaders/ws_per_48_career.html',]

def parse_page(url, headers={}):

    response = requests.get(url, headers=headers)
    soup = bs(response.content, 'html.parser')
    section = soup.find('table', class_='stats_table')
    return [a.text for a in section.find_all('a')]


ws_list, ws48_list = [parse_page(url) for url in urls]

print('ws_list = %r' % ws_list)
print('ws8_list = %r' % ws48_list)

答案 2 :(得分:1)

您可以使用函数来定义抓取逻辑,然后只需为您的网址调用它。

import os
import requests
from bs4 import BeautifulSoup as bs

def scrape(url):    
    response = requests.get(url)
    soup = bs(response.content, 'html.parser')
    section = soup.find('table', class_='stats_table')
    names = []
    for a in section.find_all('a'):
        player_name = a.text
        names.append(player_name)
    return names    

ws_list = scrape('https://www.basketball-reference.com/leaders/ws_career.html')
ws48_list = scrape('https://www.basketball-reference.com/leaders/ws_per_48_career.html')

print(ws_list)
print(ws48_list)