如何将通过网络抓取的数据整合到一个数据框中?

时间:2019-05-02 01:29:51

标签: python-3.x pandas web-scraping

我正在尝试将所有桌子拉在一起。我可以获取第一组数据,我认为这意味着抓取方面有效,但是,当我尝试将所有数据组合在一起时,我认为这是一个问题。

我尝试尽早声明数据帧,然后让表数据在每个循环中填充它。

names = {'Iron-Man',
        'Incredible-Hulk-The',
        'Thor',
        'Iron-Man-2',
        'Captain-America-The-First-Avenger',
        'Avengers-The-(2012)',
        'Iron-Man-3',
        'Thor-The-Dark-World',
        'Captain-America-The-Winter-Soldier',
        'Guardians-of-the-Galaxy',
        'Avengers-Age-of-Ultron',
        'Ant-Man',
        'Captain-America-Civil-War',
        'Doctor-Strange-(2016)',
        'Guardians-of-the-Galaxy-Vol-2',
        'Spider-Man-Homecoming',
        'Thor-Ragnarok',
        'Black-Panther',
        'Avengers-Infinity-War',
        'Ant-Man-and-the-Wasp',
        'Captain-Marvel-(2019)',
        'Avengers-Endgame-(2019)'
         }

这段代码适用于获取页面表

    data = requests.get('https://www.the-numbers.com/movie/Iron-Man#tab=box- office')
    soup = BeautifulSoup(data.text, 'html.parser')

    data = []

    div = soup.find('div' , {'id': 'box_office_chart'})
    table = div.find('table')
    tbody = table.find('tbody')
    html = table.encode().decode('utf8')
    dfs = pd.read_html(html,header=0)
    df = dfs[0]
    df

我希望这段代码可以遍历所有内容并抓住它。

for name in names:
    print(name)
    data = requests.get('https://www.the-numbers.com/movie/' + name + '#tab=box-office')
    soup = BeautifulSoup(data.text, 'html.parser')
    div = soup.find('div' , {'id': 'box_office_chart'})
    table = div.find('table')
    tbody = table.find('tbody')
    html = table.encode().decode('utf8')
    dfs = pd.read_html(html,header=0)
    df2 = dfs[0]
    df2
    df.append(df2)
    print(name)
df

所有电影都打印两次,所以我知道至少到了每一页。这是不包含任何其他电影的输出。

df Output:

    Date            Rank    Gross           % Change    Theaters    Per Theaters    Total Gross     Week    movie
0   May 2, 2008     1       $102,118,668    NaN         4105        $24,877         $102,118,668    1       Iron-Man
1   May 9, 2008     1       $51,190,629     -50%        4111        $12,452         $177,825,024    2       Iron-Man
2   May 16, 2008    2       $31,838,996     -38%        4154        $7,665          $223,124,385    3       Iron-Man
3   May 23, 2008    3       $20,447,253     -36%        3915        $5,223          $252,614,669    4       Iron-Man
4   May 30, 2008    4       $13,541,264     -34%        3650        $3,710          $276,166,336    5       Iron-Man
5   Jun 6, 2008     6       $7,477,439      -45%        2931        $2,551          $288,847,640    6       Iron-Man
6   Jun 13, 2008    7       $5,620,375      -25%        2403        $2,339          $297,918,329    7       Iron-Man
7   Jun 20, 2008    9       $4,030,272      -28%        1912        $2,108          $304,816,141    8       Iron-Man
8   Jun 27, 2008    11      $2,257,113      -44%        1379        $1,637          $309,179,318    9       Iron-Man
9   Jul 4, 2008     12      $1,459,613      -35%        1019        $1,432          $311,708,133    10      Iron-Man
10  Jul 11, 2008    14      $939,134        -36%        710         $1,323          $313,421,025    11      Iron-Man
11  Jul 18, 2008    16      $451,838        -52%        375         $1,205          $314,376,968    12      Iron-Man
12  Jul 25, 2008    22      $310,654        -31%        274         $1,134          $314,925,955    13      Iron-Man
13  Aug 1, 2008     16      $580,179        +87%        407         $1,426          $315,687,768    14      Iron-Man
14  Aug 8, 2008     19      $426,502        -26%        45          $1,236          $316,468,817    15      Iron-Man
15  Aug 15, 2008    23      $341,178        -20%        315         $1,083          $317,058,295    16      Iron-Man
16  Aug 22, 2008    29      $243,342        -29%        257         $947            $317,473,452    17      Iron-Man
17  Aug 29, 2008    33      $223,636        -8%         220         $1,017          $317,794,156    18      Iron-Man
18  Sep 5, 2008     38      $126,734        -43%        205         $618            $318,006,770    19      Iron-Man
19  Sep 12, 2008    39      $94,816         -25%        156         $608            $318,134,740    20      Iron-Man
20  Sep 19, 2008    43      $59,037         -38%        124         $476            $318,219,154    21      Iron-Man
21  Sep 26, 2008    48      $58,364         -1%         121         $482            $318,298,180    22      Iron-Man

我一直希望将其他页面中的所有表添加到df中。不知道我要去哪里错了。

编辑:因此,我摆脱了尝试获取数据的第一次尝试,只是使用了一堆elif语句来创建所有22个数据帧。感谢大家的建议。

1 个答案:

答案 0 :(得分:0)

无需执行所有的elif语句。要将循环中的当前df附加到最终结果df中,您需要使用df = df.append(df2)

import requests
import pandas as pd
from bs4 import BeautifulSoup

names = {'Iron-Man',
        'Incredible-Hulk-The',
        'Thor',
        'Iron-Man-2',
        'Captain-America-The-First-Avenger',
        'Avengers-The-(2012)',
        'Iron-Man-3',
        'Thor-The-Dark-World',
        'Captain-America-The-Winter-Soldier',
        'Guardians-of-the-Galaxy',
        'Avengers-Age-of-Ultron',
        'Ant-Man',
        'Captain-America-Civil-War',
        'Doctor-Strange-(2016)',
        'Guardians-of-the-Galaxy-Vol-2',
        'Spider-Man-Homecoming',
        'Thor-Ragnarok',
        'Black-Panther',
        'Avengers-Infinity-War',
        'Ant-Man-and-the-Wasp',
        'Captain-Marvel-(2019)',
        'Avengers-Endgame-(2019)'
         }

df = pd.DataFrame()
for name in names:
    print(name)
    url = 'https://www.the-numbers.com/movie/' + name + '#tab=box-office'
    data = requests.get(url)
    soup = BeautifulSoup(data.text, 'html.parser')
    div = soup.find('div' , {'id': 'box_office_chart'})
    table = div.find('table')
    tbody = table.find('tbody')
    html = table.encode().decode('utf8')
    dfs = pd.read_html(html,header=0)
    df2 = dfs[0]
    df2['movie'] = name
    df = df.append(df2)
    print(name)
df = df.reset_index(drop=True)

输出:

print (df)
             Date Rank  ... Week                          movie
0     Mar 8, 2019    1  ...    1          Captain-Marvel-(2019)
1    Mar 15, 2019    1  ...    2          Captain-Marvel-(2019)
2    Mar 22, 2019    2  ...    3          Captain-Marvel-(2019)
3    Mar 29, 2019    3  ...    4          Captain-Marvel-(2019)
4     Apr 5, 2019    5  ...    5          Captain-Marvel-(2019)
5    Apr 12, 2019    6  ...    6          Captain-Marvel-(2019)
6    Apr 19, 2019    4  ...    7          Captain-Marvel-(2019)
7    Apr 26, 2019    2  ...    8          Captain-Marvel-(2019)
8    Apr 27, 2018    1  ...    1          Avengers-Infinity-War
9     May 4, 2018    1  ...    2          Avengers-Infinity-War
10   May 11, 2018    1  ...    3          Avengers-Infinity-War
11   May 18, 2018    2  ...    4          Avengers-Infinity-War
12   May 25, 2018    3  ...    5          Avengers-Infinity-War
13    Jun 1, 2018    4  ...    6          Avengers-Infinity-War
14    Jun 8, 2018    5  ...    7          Avengers-Infinity-War
15   Jun 15, 2018    8  ...    8          Avengers-Infinity-War
16   Jun 22, 2018    9  ...    9          Avengers-Infinity-War
17   Jun 29, 2018   12  ...   10          Avengers-Infinity-War
18    Jul 6, 2018   15  ...   11          Avengers-Infinity-War
19   Jul 13, 2018   16  ...   12          Avengers-Infinity-War
20   Jul 20, 2018   20  ...   13          Avengers-Infinity-War
21   Jul 27, 2018   21  ...   14          Avengers-Infinity-War
22    Aug 3, 2018   24  ...   15          Avengers-Infinity-War
23   Aug 10, 2018   26  ...   16          Avengers-Infinity-War
24   Aug 17, 2018   37  ...   17          Avengers-Infinity-War
25   Aug 24, 2018   42  ...   18          Avengers-Infinity-War
26   Aug 31, 2018   44  ...   19          Avengers-Infinity-War
27    Sep 7, 2018   52  ...   20          Avengers-Infinity-War
28   Apr 26, 2019    1  ...    1        Avengers-Endgame-(2019)
29    May 5, 2017    1  ...    1  Guardians-of-the-Galaxy-Vol-2
..            ...  ...  ...  ...                            ...
367   Aug 1, 2008   16  ...   14                       Iron-Man
368   Aug 8, 2008   19  ...   15                       Iron-Man
369  Aug 15, 2008   23  ...   16                       Iron-Man
370  Aug 22, 2008   29  ...   17                       Iron-Man
371  Aug 29, 2008   33  ...   18                       Iron-Man
372   Sep 5, 2008   38  ...   19                       Iron-Man
373  Sep 12, 2008   39  ...   20                       Iron-Man
374  Sep 19, 2008   43  ...   21                       Iron-Man
375  Sep 26, 2008   48  ...   22                       Iron-Man
376   Jul 7, 2017    1  ...    1          Spider-Man-Homecoming
377  Jul 14, 2017    2  ...    2          Spider-Man-Homecoming
378  Jul 21, 2017    3  ...    3          Spider-Man-Homecoming
379  Jul 28, 2017    5  ...    4          Spider-Man-Homecoming
380   Aug 4, 2017    6  ...    5          Spider-Man-Homecoming
381  Aug 11, 2017    7  ...    6          Spider-Man-Homecoming
382  Aug 18, 2017    7  ...    7          Spider-Man-Homecoming
383  Aug 25, 2017    7  ...    8          Spider-Man-Homecoming
384   Sep 1, 2017    7  ...    9          Spider-Man-Homecoming
385   Sep 8, 2017    7  ...   10          Spider-Man-Homecoming
386  Sep 15, 2017    9  ...   11          Spider-Man-Homecoming
387  Sep 22, 2017   11  ...   12          Spider-Man-Homecoming
388  Sep 29, 2017   18  ...   13          Spider-Man-Homecoming
389   Oct 6, 2017   20  ...   14          Spider-Man-Homecoming
390  Oct 13, 2017   20  ...   15          Spider-Man-Homecoming
391  Oct 20, 2017   27  ...   16          Spider-Man-Homecoming
392  Oct 27, 2017   33  ...   17          Spider-Man-Homecoming
393   Nov 3, 2017   37  ...   18          Spider-Man-Homecoming
394  Nov 10, 2017   42  ...   19          Spider-Man-Homecoming
395  Nov 17, 2017   46  ...   20          Spider-Man-Homecoming
396  Nov 24, 2017   51  ...   21          Spider-Man-Homecoming

[397 rows x 9 columns]