使用 Beautifulsoup 对多个页面中的表格进行网页抓取

时间:2021-03-06 00:59:51

标签: python python-3.x web-scraping beautifulsoup

我正在尝试从多个页面中抓取不同周的表格,但是我一直从这个 url https://www.boxofficemojo.com/weekly/2018W52/ 获取结果,这是我正在使用的代码:

import requests
from requests import get
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
from time import sleep
from random import randint
import re

pages = np.arange(2015,2016)
week = ['01','02','03','04','05','06','07','08','09']
week1 = np.arange(10,11)
for x in week1:
    week.append(x)
week


mov = soup.find_all("table", attrs={"class": "a-bordered"})
print("Number of tables on site: ",len(mov))

all_rows= []
all_rows= []
for page in pages:
    for x in week:
        url = requests.get('https://www.boxofficemojo.com/weekly/'+str(page)+'W'+str(x)+'/')
        soup = BeautifulSoup(url.text, 'lxml')
        mov = soup.find_all("table", attrs={"class": "a-bordered"})
        table1 = mov[0]
        body = table1.find_all("tr")
        head = body[0] 
        body_rows = body[1:]
        sleep(randint(2,10))
        for row_num in range(len(body_rows)): 
            row = [] 
            for row_item in body_rows[row_num].find_all("td"): 
                aa = re.sub("(\xa0)|(\n)|,","",row_item.text)
                row.append(aa)
                all_rows.append(row)
                print('Page', page, x)

1 个答案:

答案 0 :(得分:0)

假设您想要每年 52 周,为什么不提前生成链接,然后使用 Pandas 检索表,创建此类数据框的列表并将它们连接到最终的数据框中?

import * as React from "react";
import { render, screen } from "@testing-library/react";
import userEvent from "@testing-library/user-event";
import { SweetTextboxBro } from "../SweetTextboxBro";
import { sleep } from "../timing";

test("it emits act errors", async () => {
  render(<SweetTextboxBro />); // should internally call act()
  const textbox = screen.getByRole("input");
  const input = "weeee";
  // begin state change!
  userEvent.type(textbox, input);
  await sleep(200);
  // state updated! the DOM  has changed! but no act() error... is emitted
  expect(await screen.findByText(input)).toBeInTheDocument();
});

然后您可能会寻找加快速度的方法,例如

import pandas as pd

def get_table(url):
    year = int(url[37:41])
    week_yr = int(url[42:44])
    df = pd.read_html(url)[0] 
    df['year'] = year
    df['week_yr'] = week_yr
    return df
    
years = ['2015','2016']
weeks = [str(i).zfill(2) for i in range(1, 53)]
base = 'https://www.boxofficemojo.com/weekly'
urls = [f'{base}/{year}W{week}' for week in weeks for year in years]
results = pd.concat([get_table(url, int(url.split('/')[-1][:4])) for url in urls])