我只是想从该站点的表中提取URL的第一列。而且我一直遇到KeyError:0。我才刚刚开始学习python。
Traceback (most recent call last):
File "riscribble.py", line 13, in <module>
lic_link = soup_data[0].find('a').text
File "C:\Users\rkrouse\Desktop\Python\lib\site-packages\bs4\element.py", line 1071, in __getitem__
return self.attrs[key]
KeyError: 0
关于我为什么会收到此错误和/或如何纠正的任何想法将不胜感激。
from bs4 import BeautifulSoup as soup
import requests as r
import pandas as pd
url = 'http://www.crb.state.ri.us/verify_CRB.php?page=0&letter='
data = r.get(url)
page_data = soup(data.text, 'html.parser')
soup_data = page_data.find('table')
lic_link = soup_data[0].find('a').text
df = pd.DataFrame()
for each in soup_data:
lic_link = each.find('a').text
df=df.append(pd.DataFrame({'LicenseURL': lic_link}, index=[0]))
df.to_csv('RI_License_urls.csv', index=False)
答案 0 :(得分:1)
将soup_data = page_data.find('table')
更改为soup_data = page_data.find_all('table')
。 find
仅找到第一个匹配的对象,而find_all
找到所有匹配的对象。有关更多信息,请参见here。
from bs4 import BeautifulSoup as soup
import requests as r
import pandas as pd
url = 'http://www.crb.state.ri.us/verify_CRB.php?page=0&letter='
data = r.get(url)
page_data = soup(data.text, 'html.parser')
soup_data = page_data.find_all('table')
df = pd.DataFrame()
for each in soup_data:
lic_link = each.find('a').text
df=df.append(pd.DataFrame({'LicenseURL': lic_link}, index=[0]))
df.to_csv('RI_License_urls.csv', index=False)
答案 1 :(得分:1)
进口:
from bs4 import BeautifulSoup as soup
import requests as r
import pandas as pd
import re
获取您的页面:
url = 'http://www.crb.state.ri.us/verify_CRB.php?page=0&letter='
data = r.get(url)
page_data = soup(data.text, 'html.parser')
选择链接:
links = [link.text for link in page_data.table.tr.find_all('a') if re.search('licensedetail.php', str(link))]
links -> 32922
# or
links = [link for link in page_data.table.tr.find_all('a') if re.search('licensedetail.php', str(link))]
links -> <a href="licensedetail.php?link=32922&type=Resid">32922</a>
# or
links = [link['href'] for link in page_data.table.tr.find_all('a') if re.search('licensedetail.php', str(link))]
links -> licensedetail.php?link=32922&type=Resid
# or
links = [r'www.crb.state.ri.us/' + link['href'] for link in page_data.table.tr.find_all('a') if re.search('licensedetail.php', str(link))]
links -> www.crb.state.ri.us/licensedetail.php?link=32922&type=Resid
完成:
df = pd.DataFrame(links, columns=['LicenseURL'])
df.to_csv('RI_License_urls.csv', index=False)
请记住在解决方案旁边打勾。