您好我想从多个网址中删除数据,我就是这样做
for i in range(493):
my_url = 'http://tis.nhai.gov.in/TollInformation?TollPlazaID={}'.format(i)
但它没有给我完整的数据,它只打印最后的网址数据,
这是我的代码,plz help
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
import psycopg2
import operator
for i in range(493):
my_url = 'http://tis.nhai.gov.in/TollInformation?TollPlazaID={}'.format(i)
uClient = uReq(my_url)
page1_html = uClient.read()
uClient.close()
# html parsing
page1_soup = soup(page1_html, 'html.parser')
# grabing data
containers = page1_soup.findAll('div', {'class': 'PA15'})
# Make the connection to PostgreSQL
conn = psycopg2.connect(database='--',user='--', password='--', port=--)
cursor = conn.cursor()
for container in containers:
toll_name1 = container.p.b.text
toll_name = toll_name1.split(" ")[1]
search1 = container.findAll('b')
highway_number = search1[1].text.split(" ")[0]
text = search1[1].get_text()
onset = text.index('in')
offset = text.index('Stretch')
state = str(text[onset +2:offset]).strip(' ')
location = list(container.p.descendants)[10]
mystr = my_url[my_url.find('?'):]
TID = mystr.strip('?TollPlazaID=')
query = "INSERT INTO tollmaster (TID, toll_name, location, highway_number, state) VALUES (%s, %s, %s, %s, %s);"
data = (TID, toll_name, location, highway_number, state)
cursor.execute(query, data)
# Commit the transaction
conn.commit()
但它只显示倒数第二个网址数据
答案 0 :(得分:1)
似乎有些网页缺少您的关键信息,您可以使用error-catching
,例如:
try:
tbody = soup('table', {"class": "tollinfotbl"})[0].find_all('tr')[1:]
except IndexError:
continue # Skip this page if no items were scrapped
您可能需要添加一些日志记录/打印信息以跟踪不存在的表。
编辑:
它仅显示来自最后一页的信息,因为您要在for
循环之外提交您的交易,并为每conn
覆盖i
。只需将conn.commit()
置于for
循环内,即远端。