我正在使用python,BeautifulSoup,pandas和Google表格创建一个网络抓取程序。 到目前为止,我已经设法从Google表格的列表中的几个网页中抓取数据表。我想要实现的是,对于每个URL中的每个表,我想要创建一个数据框。
到目前为止,我的代码如下:
import gspread
from google.oauth2 import service_account
from google.auth.transport.requests import AuthorizedSession
import pandas as pd
from bs4 import BeautifulSoup
import requests
credentials = service_account.Credentials.from_service_account_file(
'credentials.json')
scoped_credentials = credentials.with_scopes(
['https://spreadsheets.google.com/feeds',
'https://www.googleapis.com/auth/drive']
)
gc = gspread.Client(auth=scoped_credentials)
gc.session = AuthorizedSession(scoped_credentials)
sheet = gc.open_by_key('api_key')
worksheet = sheet.sheet1
link_list = worksheet.col_values(2)
def get_info(page_url) :
page = requests.get(page_url)
soup = BeautifulSoup(page.content, 'html.parser')
try :
tbl = soup.find('table')
labels = []
data = []
for tr in tbl.findAll('tr'):
imp_labels = [th.text.strip() for th in tr.findAll('th')]
imp_data = [td.text.strip() for td in tr.findAll('td')]
labels.append(imp_labels)
data.append(imp_data)
col_names = {'Labels': imp_labels, 'Data': imp_data}
df = pd.DataFrame([labels, data], col_names)
df_t = df.T
print(df_t)
except Exception as e:
print(e)
for link in link_list :
get_info(link)
输出:
Labels Data
0 [Celebrated Name] [Don Lemon]
1 [Age] [54 Years]
2 [Nick Name] [Don Lemon]
3 [Birth Name] [Don Lemon]
4 [Birth Date] [1966-03-01]
5 [Gender] [Male]
6 [Profession] [Journalist]
7 [Birth Nation] [United States]
8 [Place Of Birth] [Baton Rouge, Louisiana, United States]
9 [Nationality] [American]
10 [Siblings] [Leisa Lemon, Yma Lemon]
11 [Ethnicity] [Mixed]
12 [Eye Color] [Brown]
13 [Hair Color] [Black]
14 [Religion] [Christian]
15 [Height] [5 Feet 6 Inch]
16 [Weight] [Not Known]
17 [Working For] [CNN]
18 [Best Known For] [CNN Tonight]
19 [School] [Baker High School]
20 [College / University] [Brookyn College]
21 [University] [Louisiana State University]
22 [Horoscope] [Pisces]
23 [Net Worth] [$ 3 million (As of 2018)]
24 [Famous For] [For hosting the program ‘CNN Tonight’]
25 [Body Measurement] [40-32-35]
26 [Awards] [Emmy Award]
27 [Salary] [$125000]
28 [Links] [WikipediaFacebookTwitterInstagram]
Labels Data
0 [Celebrated Name] [2 Chainz]
1 [Age] [43 Years]
2 [Nick Name] [Tity Boi, Drenchgod]
3 [Birth Name] [Tauheed Epps]
4 [Birth Date] [1977-09-12]
5 [Gender] [Male]
6 [Profession] [Rapper]
7 [Place Of Birth] [College Park, Georgia, United States]
8 [Nationality] [American]
9 [Ethnicity] [Afro-American]
10 [Horoscope] [Virgo]
11 [High School] [North Clayton High School]
12 [University] [Alabama State University and Virginia State U...
13 [Marital Status] [Married]
14 [Wife] [Kesha Ward]
15 [Children] [Heaven, Harmony, and Halo]
16 [Body Build/Type] [Athletic]
17 [Body Measurement] [43-15-34 inches]
18 [Chest Size] [43 inches]
19 [Bicep Size] [15 inches]
20 [Waist Size] [34 inches]
21 [Shoe Size] [14 (US]
22 [Height] [6 feet 5 inches]
23 [Weight] [88 kg]
24 [Net Worth] [$ 6 Million]
25 [Salary] [$ 100,000]
26 [Sexual Orientation] [Straight]
27 [Eye Color] [Dark Brown]
28 [Hair Color] [Black]
29 [Links] [Wikipedia,Instagram,Twitter,Facebook]
所以,我的问题是:
我是Python的新手,如果它有点混乱,我深表歉意。预先感谢。