使用BeautifulSoup提取表数据

时间:2019-01-22 02:20:20

标签: beautifulsoup

使用BeautifulSoup提取数据(邮政编码和填充)时会遇到一些麻烦。任何帮助表示赞赏。

import pandas as pd    
import numpy as np    
import requests    
from bs4 import BeautifulSoup    

pop_source = requests.get("https://www.zip-codes.com/city/tx-austin.asp").text

soup = BeautifulSoup(pop_source, 'html5lib')    
zip_pop_table = soup.find('table',class_='statTable')    

austin_pop = pd.DataFrame(columns=['Zip Code','Population'])    

for row in zip_pop_table.find_all('tr'):    
    cols = row.find_all('td') 

现在,我被卡住了。真的不知道如何在所需的列中提取数据并将其追加到在空数据框中创建的列中。

任何帮助表示赞赏。

1 个答案:

答案 0 :(得分:0)

您只需要遍历cols,并将其转储到austin_pop数据框中。

因此,我通过使用列表理解来列出cols中的数据列表来做到这一点:

row_list = [ data.text for data in cols ]

等同于for循环的列表理解。您可以使用。

row_list = []
for data in cols:
    rows_list.append(data.text)

创建一行,保留所需的2列,然后将其转储到austin_pop

temp_df = pd.DataFrame([row_list], columns = ['Zip Code','type','county','Population', 'area_codes'])
temp_df = temp_df[['Zip Code', 'Population']]
austin_pop = austin_pop.append(temp_df).reset_index(drop = True)

完整代码:

import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup


url = "https://www.zip-codes.com/city/tx-austin.asp"
pop_source = requests.get("https://www.zip-codes.com/city/tx-austin.asp").text

soup = BeautifulSoup(pop_source, 'html5lib')
zip_pop_table = soup.find('table',class_='statTable')

austin_pop = pd.DataFrame(columns=['Zip Code','Population'])

for row in zip_pop_table.find_all('tr'):
    cols = row.find_all('td')
    row_list = [ data.text for data in cols ]

    temp_df = pd.DataFrame([row_list], columns = ['Zip Code','type','county','Population', 'area_codes'])
    temp_df = temp_df[['Zip Code', 'Population']]
    austin_pop = austin_pop.append(temp_df).reset_index(drop = True)


austin_pop = austin_pop.iloc[1:, :] 
austin_pop['Zip Code'] = austin_pop['Zip Code'].apply(lambda x: x.split()[-1])

输出:

print (austin_pop)
   Zip Code Population
1     73301          0
2     73344          0
3     78681     50,606
4     78701      6,841
5     78702     21,334
6     78703     19,690
7     78704     42,117
8     78705     31,340
9     78708          0
10    78709          0
11    78710          0
12    78711          0
13    78712        860
14    78713          0
15    78714          0
16    78715          0
17    78716          0
18    78717     22,538
19    78718          0
20    78719      1,764
21    78720          0
22    78721     11,425
23    78722      5,901
24    78723     28,330
25    78724     21,696
26    78725      6,083
27    78726     13,122
28    78727     26,689
29    78728     20,299
30    78729     27,108
..      ...        ...
45    78746     26,928
46    78747     14,808
47    78748     40,651
48    78749     34,449
49    78750     26,814
50    78751     14,385
51    78752     18,064
52    78753     49,301
53    78754     15,036
54    78755          0
55    78756      7,194
56    78757     21,310
57    78758     44,072
58    78759     38,891
59    78760          0
60    78761          0
61    78762          0
62    78763          0
63    78764          0
64    78765          0
65    78766          0
66    78767          0
67    78768          0
68    78772          0
69    78773          0
70    78774          0
71    78778          0
72    78779          0
73    78783          0
74    78799          0

[74 rows x 2 columns]