我是python和bs4的新手。我已经尝试了好几个小时,用漂亮的汤和熊猫在几个网页上刮桌子。当我抓取2个页面时,它起作用了,但是当我尝试抓取所有13个网页时,我遇到了麻烦。当我将范围功能从2更改为13时,代码不会生成DF或CSV文件。我在做什么错?
dfs=[]
for page in range(13):
http = "http://websitexample/Records?year=2020&page={}".format(page+1)
url = requests.get(http)
soup = BeautifulSoup(url.text, "lxml")
table = soup.find('table')
df_list = pd.read_html(url.text)
df = pd.concat(df_list)
links = []
for tr in table.find_all("tr"):
trs = tr.find_all("td")
for each in trs:
try:
link = each.find('a')['href']
links.append(link)
except:
pass
df['Link']= links
dfs.append(df)
final_df = pd.concat(dfs)
final_df.to_csv("NewFileAll13.csv",index=False,encoding='utf-8-sig')
我收到错误消息:
值错误:值的长度与索引的长度不匹配。
我非常感谢您提供的任何建议。谢谢!
答案 0 :(得分:1)
要从所有页面下载所有数据和链接,可以使用以下示例:
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = 'http://reactwarn.floridajobs.org/WarnList/Records'
params = {
'year': 2020,
'page': 1
}
all_data = []
for params['page'] in range(1, 14):
print('Page {}..'.format(params['page']))
soup = BeautifulSoup(requests.get(url, params=params).content, 'lxml')
for row in soup.select('tbody tr'):
tds = [td.get_text(strip=True, separator='\n') for td in row.select('td')][:-1] + [row.a['href'] if row.a else '']
all_data.append(tds)
df = pd.DataFrame(all_data, columns=['Company Name', 'State Notification Date', 'Layoff Date', 'Employees Affected', 'Industry', 'Attachment'])
print(df)
df.to_csv('data.csv', index=False)
打印:
Company Name ... Attachment
0 TrueCore Behavioral Solutions\n5050 N.E. 168th... ... /WarnList/Download?file=%5C%5Cdeo-wpdb005%5CRe...
1 Cuba Libre Orlando, LLC t/a Cuba Libre Restaur... ... /WarnList/Download?file=%5C%5Cdeo-wpdb005%5CRe...
2 Hyatt Regency Orlando\n9801 International Dr.O... ... /WarnList/Download?file=%5C%5Cdeo-wpdb005%5CRe...
3 ABM. Inc.\nNova Southeastern University3301 Co... ... /WarnList/Download?file=%5C%5Cdeo-wpdb005%5CRe...
4 Newport Beachside Resort\n16701 Collins Avenue... ... /WarnList/Download?file=%5C%5Cdeo-wpdb005%5CRe...
... ... ... ...
1251 P.F. Chang's China Bistro\n3597 S.W. 32nd Ct.,... ... /WarnList/Download?file=%5C%5Cdeo-wpdb005%5CRe...
1252 P.F. Chang's China Bistro\n11361 N.W. 12th St.... ... /WarnList/Download?file=%5C%5Cdeo-wpdb005%5CRe...
1253 P.F. Chang's China Bistro\n8888 S.W. 136th St.... ... /WarnList/Download?file=%5C%5Cdeo-wpdb005%5CRe...
1254 P.F. Chang's China Bistro\n17455 Biscayne Blvd... ... /WarnList/Download?file=%5C%5Cdeo-wpdb005%5CRe...
1255 Grand Hyatt Tampa Bay\n2900 Bayport DriveTAMPA... ... /WarnList/Download?file=%5C%5Cdeo-wpdb005%5CRe...
[1256 rows x 6 columns]
并保存data.csv
(来自LibreOffice的屏幕截图):