使用BeautifulSoup提取数据(邮政编码和填充)时会遇到一些麻烦。任何帮助表示赞赏。
import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup
pop_source = requests.get("https://www.zip-codes.com/city/tx-austin.asp").text
soup = BeautifulSoup(pop_source, 'html5lib')
zip_pop_table = soup.find('table',class_='statTable')
austin_pop = pd.DataFrame(columns=['Zip Code','Population'])
for row in zip_pop_table.find_all('tr'):
cols = row.find_all('td')
现在,我被卡住了。真的不知道如何在所需的列中提取数据并将其追加到在空数据框中创建的列中。
任何帮助表示赞赏。
答案 0 :(得分:0)
您只需要遍历cols
,并将其转储到austin_pop
数据框中。
因此,我通过使用列表理解来列出cols
中的数据列表来做到这一点:
row_list = [ data.text for data in cols ]
等同于for
循环的列表理解。您可以使用。
row_list = []
for data in cols:
rows_list.append(data.text)
创建一行,保留所需的2列,然后将其转储到austin_pop
:
temp_df = pd.DataFrame([row_list], columns = ['Zip Code','type','county','Population', 'area_codes'])
temp_df = temp_df[['Zip Code', 'Population']]
austin_pop = austin_pop.append(temp_df).reset_index(drop = True)
完整代码:
import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup
url = "https://www.zip-codes.com/city/tx-austin.asp"
pop_source = requests.get("https://www.zip-codes.com/city/tx-austin.asp").text
soup = BeautifulSoup(pop_source, 'html5lib')
zip_pop_table = soup.find('table',class_='statTable')
austin_pop = pd.DataFrame(columns=['Zip Code','Population'])
for row in zip_pop_table.find_all('tr'):
cols = row.find_all('td')
row_list = [ data.text for data in cols ]
temp_df = pd.DataFrame([row_list], columns = ['Zip Code','type','county','Population', 'area_codes'])
temp_df = temp_df[['Zip Code', 'Population']]
austin_pop = austin_pop.append(temp_df).reset_index(drop = True)
austin_pop = austin_pop.iloc[1:, :]
austin_pop['Zip Code'] = austin_pop['Zip Code'].apply(lambda x: x.split()[-1])
输出:
print (austin_pop)
Zip Code Population
1 73301 0
2 73344 0
3 78681 50,606
4 78701 6,841
5 78702 21,334
6 78703 19,690
7 78704 42,117
8 78705 31,340
9 78708 0
10 78709 0
11 78710 0
12 78711 0
13 78712 860
14 78713 0
15 78714 0
16 78715 0
17 78716 0
18 78717 22,538
19 78718 0
20 78719 1,764
21 78720 0
22 78721 11,425
23 78722 5,901
24 78723 28,330
25 78724 21,696
26 78725 6,083
27 78726 13,122
28 78727 26,689
29 78728 20,299
30 78729 27,108
.. ... ...
45 78746 26,928
46 78747 14,808
47 78748 40,651
48 78749 34,449
49 78750 26,814
50 78751 14,385
51 78752 18,064
52 78753 49,301
53 78754 15,036
54 78755 0
55 78756 7,194
56 78757 21,310
57 78758 44,072
58 78759 38,891
59 78760 0
60 78761 0
61 78762 0
62 78763 0
63 78764 0
64 78765 0
65 78766 0
66 78767 0
67 78768 0
68 78772 0
69 78773 0
70 78774 0
71 78778 0
72 78779 0
73 78783 0
74 78799 0
[74 rows x 2 columns]