嵌套的循环不断重复

时间:2019-06-17 08:29:14

标签: python web-scraping beautifulsoup python-requests

我有一个主要的python scraper

从文本到数组读取邮政编码列表

对于数组中的每个邮政编码 搜索10页 提取某些内容。

我似乎正在得到如下结果: 第1页 第2页 第2页 第3页 第3页 第3页 第4页 第4页 第4页 第4页

我已经尝试过几次重新排列代码,但看起来却不太好,除非此步骤正常,否则一切正常


from bs4 import BeautifulSoup
import time
from time import sleep
from datetime import datetime
import requests
import csv

print(" Initializing ...")
print(" Loading Keywords")
with open("pcodes.txt") as pcodes:
    postkeys = []
    for line in pcodes:
        postkeys.append(line.strip())

with open("pcodnum.txt") as pcodnum:
    postkeynum = []
    for line in pcodnum:
        postkeynum.append(line.strip())

print(" Welcome to YellScrape v1.0")
print(" You ar searching yell.com ")

comtype = input(" Please enter a Company Type (e.g Newsagent, Barber): ")
pagesnum = 0
listinnum = 0
comloc = " "
f = csv.writer(open(datetime.today().strftime('%Y-%m-%d') + '-' + comtype + '-' + 'yelldata.csv', 'w'))
f.writerow(['Business Name', 'Business Type', 'Phone Number', 'Street Address', 'Locality', 'Region', 'Website'])

headers = {
        'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36',
    }

data_list = []
for x in postkeys:
    print(" Searching " + x + " for " + comtype + " companies")
    for y in postkeynum:
        url = 'https://www.yell.com/ucs/UcsSearchAction.do?keywords=' + comtype + '&pageNum=' + str(y) + '&location=' + x
        data_list.append(url)
        for item in data_list:
            site = requests.get(item, headers=headers)
            soup = BeautifulSoup(site.content, 'html.parser')
            questions = soup.select('.businessCapsule--mainContent')
            for question in questions:
                listinnum += 1
                busname = question.find(class_='businessCapsule--name').get_text()
                bustype =   question.find(class_='businessCapsule--classification').get_text()
                busnumber = question.select_one('span.business--telephoneNumber')
                if busnumber is None:
                    busnumber = 'None'
                else:
                    busnumber = busnumber.text
                busadd = question.find('span', attrs={"itemprop": "streetAddress"})
                if busadd is None:
                    busadd = 'None'
                else:
                    busadd = busadd.text.replace(',',' ')
                buslocal = question.find('span', attrs={"itemprop": "addressLocality"})
                if buslocal is None:
                    buslocal = 'None'
                else:
                    buslocal = buslocal.text
                buspost = question.find('span', attrs={"itemprop": "postalCode"})
                if buspost is None:
                    buspost = 'None'
                else:
                    buspost = buspost.text
                busweb = question.find('a', attrs={"rel": "nofollow noopener"})
                if busweb is None:
                    busweb = 'None'
                else:
                    busweb = busweb.attrs['href']
                print(busweb)
                f.writerow([busname, bustype, busnumber, busadd, buslocal, buspost, busweb])


        pagesnum += 1
        print(" Finsihed Page " + str(y) + ". For " + x + " . " + str(listinnum) + " listings so far. Moving To Next Page")
    print(" Waiting 30 seconds for security reasons.")
    sleep(30)
print(" Finished. \n Total: " + str(pagesnum) + " pages with " + str(listinnum) + " listings. \n Please look for file: " + datetime.today().strftime('%Y-%m-%d') + '-' + comtype + '-' + 'yelldata.csv')

预期结果:

第1页完成 完成第2页 完成第3页

2 个答案:

答案 0 :(得分:1)

这是因为您要添加到数据列表,然后在每次添加新链接后使用for循环对其进行迭代。

因此,它将在第1页上进行requests,然后在第1页上进行requests,在第2页上进行requests,然后在第1、2和3页上进行,然后在第1、2页上进行。 3和4 ...等等

因此,有两种方法可以解决此问题。 1)不要附加到data_list并将其全部消除,或者2)您可以先附加到data_list,然后循环遍历(因此请分开附加到data_list的循环并遍历{{ 1}}。

我选择选项2)

data_list

答案 1 :(得分:0)

为循环初始化pageNum:

for x in postkeys:
   pageNum = 1

递增页面循环和格式网址的数字端

for item in data_list:
    #format website url
    url = "https://www.yell.com/ucs/UcsSearchAction.do?keywords={}&pageNum={}&location={}".format(comtype, pageNum, x)
    site = requests.get(url, headers=headers)

    # check response status code:
    if site.status_code != 200:
        break

    pageNum += 1

您应该删除此for循环:

for y in postkeynum:
        url = 'https://www.yell.com/ucs/UcsSearchAction.do?keywords=' + comtype + '&pageNum=' + str(y) + '&location=' + x
        data_list.append(url)