Question

我是python和网络抓取的新手。我通过使用请求和beautifulsoup编写了一些代码。一种代码是用于抓取价格，名称和链接。效果很好，如下所示：

from bs4 import BeautifulSoup
import requests

urls = "https://www.meisamatr.com/fa/product/cat/2-%D8%A2%D8%B1%D8%A7%DB%8C%D8%B4%DB%8C.html#/pagesize-24/order-new/stock-1/page-1"
source = requests.get(urls).text
soup = BeautifulSoup(source, 'lxml')

for figcaption in soup.find_all('figcaption'):
    price = figcaption.div.text
    name = figcaption.find('a', class_='title').text
    link = figcaption.find('a', class_='title')['href']

    print(price)
    print(name)
    print(link)

，还有一个用于制作我需要从中刮除这些信息的其他网址，当我使用print（）时，该网址也提供了正确的网址：

x = 0
counter = 1

for x in range(0, 70)
    urls = "https://www.meisamatr.com/fa/product/cat/2-%D8%A2%D8%B1%D8%A7%DB%8C%D8%B4%DB%8C.html#/pagesize-24/order-new/stock-1/page-" + str(counter)
    counter += 1
    x += 1
    print(urls)

但是，当我尝试将这两个内容结合起来以抓取页面，然后将url更改为新的网址然后再进行抓取时，它只会在首页上提供已抓取的信息70次。请通过这个指导我。整个代码如下：

from bs4 import BeautifulSoup
import requests

x = 0
counter = 1
for x in range(0, 70):
    urls = "https://www.meisamatr.com/fa/product/cat/2-%D8%A2%D8%B1%D8%A7%DB%8C%D8%B4%DB%8C.html#/pagesize-24/order-new/stock-1/page-" + str(counter)
    source = requests.get(urls).text
    soup = BeautifulSoup(source, 'lxml')
    counter += 1
    x += 1
    print(urls)

    for figcaption in soup.find_all('figcaption'):
        price = figcaption.div.text
        name = figcaption.find('a', class_='title').text
        link = figcaption.find('a', class_='title')['href']

        print(price)
        print()
        print(name)
        print()
        print(link)

Answer 1

您的x=0然后再用1表示是多余的，因此不需要，因为您要遍历该范围range(0, 70)。我也不确定为什么要使用counter，因为您也不需要。以下是您的操作方法：

但是，我认为问题不在于迭代或循环，而是URL本身。如果您手动转到下面列出的两个页面，则内容不会更改：

https://www.meisamatr.com/fa/product/cat/2-%D8%A2%D8%B1%D8%A7%DB%8C%D8%B4%DB%8C.html#/pagesize-24/order-new/stock-1/page-1

然后

https://www.meisamatr.com/fa/product/cat/2-%D8%A2%D8%B1%D8%A7%DB%8C%D8%B4%DB%8C.html#/pagesize-24/order-new/stock-1/page-2

由于网站是动态的，因此您需要找到另一种方法来逐页进行迭代，或者弄清楚确切的URL是什么。因此，尝试：

from bs4 import BeautifulSoup
import requests

for x in range(0, 70):
    try:
        urls = 'https://www.meisamatr.com/fa/product/cat/2-%D8%A2%D8%B1%D8%A7%DB%8C%D8%B4%DB%8C.html&pagesize[]=24&order[]=new&stock[]=1&page[]=' +str(x+1) + '&ajax=ok?_=1561559181560'
        source = requests.get(urls).text
        soup = BeautifulSoup(source, 'lxml')

        print('Page: %s' %(x+1))

        for figcaption in soup.find_all('figcaption'):

            price = figcaption.find('span', {'class':'new_price'}).text.strip()
            name = figcaption.find('a', class_='title').text
            link = figcaption.find('a', class_='title')['href']

            print('%s\n%s\n%s' %(price, name, link))
    except:
        break

您可以通过转到网站并查看开发工具（Ctrl + Shift + I或右键单击“检查”）找到该链接->网络-> XHR

当我这样做之后，然后物理地单击到下一页，我可以看到该数据的呈现方式，并找到了参考URL。

如何从下一页取消价格？

1 个答案: