Bs4和请求代理脚本没有提取数据

时间:2018-03-15 16:14:17

标签: python proxy beautifulsoup python-requests python-2.x

我目前正在制作一个应该从supremenewyork.com获取一些信息的脚本。该脚本的工作原理是从最高我们(本地网站)提取信息,但我希望它从英国网站上提取相同的信息,所以我添加了一个代理脚本(我知道这是有效的,因为我在一个更小的更简单的脚本上测试它它能够从英国网站上获取美国网站上不存在的信息。无论如何,这是我的脚本:

import requests
from bs4 import BeautifulSoup
# make sure proxy is http and port 8080
UK_Proxy1 = raw_input('UK http Proxy1: ')
UK_Proxy2 = raw_input('UK http Proxy2: ')

proxies = {
 'http': 'http://' + UK_Proxy1 + '',
   'https': 'http://' + UK_Proxy2 + '',

}

categorys = ['jackets','shirts','tops_sweaters','sweatshirts','pants','shorts','t-shirts','hats','bags','accessories','shoes','skate']
catNumb = 0

for cat in categorys:
    catStr = str(categorys[catNumb])
    cUrl = 'http://www.supremenewyork.com/shop/all/' + catStr
    proxy_script = requests.get(cUrl, proxies=proxies).text
    bSoup = BeautifulSoup(proxy_script, 'lxml')
    print('\n*******************"'+ catStr.upper() + '"*******************\n')
catNumb += 1
for item in bSoup.find_all('div', class_='inner-article'):
    url = item.a['href']
    alt = item.find('img')['alt']
    req = requests.get('http://www.supremenewyork.com' + url)
    item_soup = BeautifulSoup(req.text, 'lxml')
    name = item_soup.find('h1', itemprop='name').text
    style = item_soup.find('p', itemprop='model').text
    print alt +(' --- ')+ name +(' --- ')+ style

当我运行此脚本时,它所做的只是打印类别,但它们之间没有信息。例如:******夹克********,*****衬衫*****等。我还以不同的方式测试了这个脚本,只是从附件类别中提取信息(信息与美国网站不同):

import requests
from bs4 import BeautifulSoup

UK_Proxy1 = raw_input('UK http Proxy1: ')
UK_Proxy2 = raw_input('UK http Proxy2: ')

proxies = {
    'http': 'http://' + UK_Proxy1 + '',
        'https': 'http://' + UK_Proxy2 + '',

}

r10 = requests.get('http://www.supremenewyork.com/shop/all/accessories', proxies=proxies)

soup10 = BeautifulSoup(r10.text, 'lxml')

for item in soup10.find_all('div', class_='inner-article'):
    url = item.a['href']
    alt = item.find('img')['alt']
    req = requests.get('http://www.supremenewyork.com' + url)
    item_soup = BeautifulSoup(req.text, 'lxml')
    name = item_soup.find('h1', itemprop='name').text
    style = item_soup.find('p', itemprop='model').text
    print alt +(' --- ')+ name +(' --- ')+ style

当我运行上面的脚本时,它只是立即转到下一个命令输入事项>>>在终端。有人可以解释为什么会这样吗?

1 个答案:

答案 0 :(得分:0)

没有代理的测试表明你的缩进需要修复:

import requests
from bs4 import BeautifulSoup

categorys = ['jackets','shirts','tops_sweaters','sweatshirts','pants','shorts','t-shirts','hats','bags','accessories','shoes','skate']

for catStr in categorys:
    cUrl = 'http://www.supremenewyork.com/shop/all/' + catStr
    proxy_script = requests.get(cUrl).text
    bSoup = BeautifulSoup(proxy_script, 'lxml')
    print '\n*******************"{}"*******************\n'.format(catStr.upper())

    for item in bSoup.find_all('div', class_='inner-article'):
        url = item.a['href']
        alt = item.find('img')['alt']
        req = requests.get('http://www.supremenewyork.com' + url)
        item_soup = BeautifulSoup(req.text, 'lxml')
        name = item_soup.find('h1', itemprop='name').text
        style = item_soup.find('p', itemprop='model').text
        print u'{} --- {} --- {}'.format(alt, name, style)

这为您的每个类别生成了输出:

*******************"JACKETS"*******************

Micep6hveho --- Supreme®/UNDERCOVER/Public Enemy Taped Seam Parka --- Multi
Cl x 5vuuu4 --- Supreme®/UNDERCOVER/Public Enemy Puffy Jacket --- Multi
3ez mg0uszy --- Supreme®/UNDERCOVER/Public Enemy Work Jacket --- Dusty Teal
Vywmx4joolc --- Supreme®/UNDERCOVER/Public Enemy Work Jacket --- Black
A8fft p1ixi --- Tiger Stripe Track Jacket --- Brown
Bxvxpc8 dng --- Tiger Stripe Track Jacket --- White

*******************"SHIRTS"*******************

L ktbqa0olk --- Supreme®/UNDERCOVER/Public Enemy Rayon Shirt --- Red
Wl0es6ehmlc --- Supreme®/UNDERCOVER/Public Enemy Rayon Shirt --- Black
Qcpiswrgv0q --- Supreme®/UNDERCOVER/Public Enemy Rayon Shirt --- Green
Cxa3m57wswk --- Denim Shirt --- Blue