网络爬虫给出随机值

时间:2021-06-11 12:37:13

标签: python python-3.x web-scraping data-science data-science-experience

在工作中,我的任务是对积木进行市场分析。我选择了一些竞争对手并制作了网络爬虫来收集他们的价格。 它适用于大多数砖块类型,但在某些砖块类型上它会改变值或说没有匹配项。

问题仅与 Prices_Building 相关。其余的代码运行良好,奇怪的是,如果我使用 Prices_Building 代码只搜索一个名称,它就会得到正确的结果。

这是Output spreadsheet

的图片

绿色部分是网站上的正确值,红色部分是错误的,{} 中的值正确(如果存在)。

这是我的代码:

sheet = client.open("Bricks Compare Prices").get_worksheet(0)



Prices_Amari = []
Prices_Wholesale = []
Prices_Building = []
Names = []
Prices_Amari = []
#List of bricks to compare
lis = [ (list of names boiled down to NAME pack of SIZE

]

Prices_Building = []
Namez = []
for name in lis: # for every name in the list
    target = name.rpartition("Pack")[0] #get the essential name 
    pack_size = re.search(pattern = '[0-9]+', string=name).group() #get the pack size
    res = requests.get("https://eucs13.ksearchnet.com/cloud-search/n-search/search?ticket=klevu-15598202362809967&term={}&paginationStartsFrom=0&sortPrice=false&ipAddress=undefined&analyticsApiKey=klevu-15598202362809967&showOutOfStockProducts=true&klevuFetchPopularTerms=false&klevu_priceInterval=500&fetchMinMaxPrice=true&klevu_multiSelectFilters=true&noOfResults=1&klevuSort=rel&enableFilters=true&layoutVersion=1.0&autoComplete=false&autoCompleteFilters=&filterResults=&visibility=search&category=KLEVU_PRODUCT&klevu_filterLimit=50&sv=2316&lsqt=&responseType=json&klevu_loginCustomerGroup=".format(name))
    results = json.loads(res.text)['result'] #go to this site, search for the brick 

for i in results: #for every result, check that the name and pack size is in the title, or sau there's no match
    if target in i['name'] and pack_size in i['name']:
        Prices_Building.append(i['salePrice'])
        Namez.append(i['name'])
    else:
        Prices_Building.append("No match in Building Supplies Online" + name)
        Namez.append(i['name'])

#repeat 其他网站 对于 lis 中的名称:

def get_url_Amaari(search_term):
    build = 'https://ammaaristones.co.uk/?s={}&post_type=product'
    url = build.format(search_term)
    return url
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'}
result_Ammaristones = requests.get(get_url_Amaari(Name), headers=headers)
try:
    soupAmm = BeautifulSoup(result_Ammaristones.text, 'lxml')
    Par = soupAmm.find('div', class_='box-text box-text-products')
    PriceAmm = re.findall("[-+]?[.]?[\d]+(?:,\d\d\d)*[\.]?\d*(?:[eE][-+]?\d+)?",Par.find('bdi').text)[0]
    Prices_Amari.append(PriceAmm)

except:
    PriceAmm = "no match in Ammari Stones for:" + Name
    Prices_Amari.append(PriceAmm)
    pass

#repeat 其他网站 对于 lis 中的名称:

try:
    headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'}
    def get_url_Wholesale(search_term):
        build = 'https://brickwholesale.co.uk/?s={}&post_type=product&dgwt_wcas=1'
        url = build.format(search_term)
        return url
    result_Wholesale = requests.get(get_url_Wholesale(Name), headers=headers)


    soupWhole = BeautifulSoup(result_Wholesale.text, 'html.parser')
    Pparent = soupWhole.find_all('span', class_='woocommerce-Price-currencySymbol')
    Whole = (float(re.findall("[-+]?[.]?[\d]+(?:,\d\d\d)*[\.]?\d*(?:[eE][-+]?\d+)?",soupWhole.find('bdi').text.strip())[0]))*1.2+96
    PriceWhole = math.floor(Whole)
    if PriceWhole == 96:
        PriceWhole = "No Match in Wholesale Bricks for: " + Name
    
    Prices_Wholesale.append(PriceWhole)
    
except:
    PriceWhole = "no match in wholesale Bricks Stones for:" + Name
    


  

#print 到 google sheet 一次一行,匹配价格进行比较

for j in range(len(lis)):
    time.sleep(1)
    row =[lis[j],Prices_Amari[j], Prices_Building[j], Prices_Wholesale[j]]
    sheet.append_row(row)

1 个答案:

答案 0 :(得分:0)

在不打印或报告异常的情况下使用 except: 是非常危险的,因为您完全隐藏了可能发生的每个异常。您应该作为绝对最低限度打印异常,或者为了正确使用特定的异常,除了您期望可能发生的异常并且愿意抑制但让其他人提升自己并停止您的代码,以便您知道发生了一些不寻常的事情。< /p>