如何使用python

时间:2017-09-04 10:32:48

标签: python web-scraping beautifulsoup

您好我想从多个网址中删除数据,我就是这样做

for i in range(493):
    my_url = 'http://tis.nhai.gov.in/TollInformation?TollPlazaID={}'.format(i)

但它没有给我完整的数据,它只打印最后的网址数据,

这是我的代码,plz help

from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
import psycopg2
import operator


for i in range(493):
    my_url = 'http://tis.nhai.gov.in/TollInformation?TollPlazaID={}'.format(i)

    uClient = uReq(my_url)
    page1_html = uClient.read()
    uClient.close()
    # html parsing
    page1_soup = soup(page1_html, 'html.parser')

    # grabing data
    containers = page1_soup.findAll('div', {'class': 'PA15'})

    # Make the connection to PostgreSQL
    conn = psycopg2.connect(database='--',user='--', password='--', port=--)
    cursor = conn.cursor()
    for container in containers:
        toll_name1 = container.p.b.text
        toll_name = toll_name1.split(" ")[1]

        search1 = container.findAll('b')
        highway_number = search1[1].text.split(" ")[0]

        text = search1[1].get_text()
        onset = text.index('in')
        offset = text.index('Stretch')
        state = str(text[onset +2:offset]).strip(' ')

        location = list(container.p.descendants)[10]
        mystr = my_url[my_url.find('?'):]
        TID = mystr.strip('?TollPlazaID=')

        query = "INSERT INTO tollmaster (TID, toll_name, location, highway_number, state) VALUES (%s, %s, %s, %s, %s);"
        data = (TID, toll_name, location, highway_number, state)

        cursor.execute(query, data)

# Commit the transaction
conn.commit()

但它只显示倒数第二个网址数据

1 个答案:

答案 0 :(得分:1)

似乎有些网页缺少您的关键信息,您可以使用error-catching,例如:

try: 
    tbody = soup('table', {"class": "tollinfotbl"})[0].find_all('tr')[1:]
except IndexError:
    continue  # Skip this page if no items were scrapped

您可能需要添加一些日志记录/打印信息以跟踪不存在的表。

编辑: 它仅显示来自最后一页的信息,因为您要在for循环之外提交您的交易,并为每conn覆盖i。只需将conn.commit()置于for循环内,即远端。