Question

我有这个craigslist聚合器应用程序，我想获取页面上每个列表的url链接。

现在，我可以在load_craigslist_url()的每一页上获得列表标题，价格，发布日期以及与邮政编码的距离。 extract_post_urls()向我返回了一个URL列表，但是我注意到我只得到一个页面的列表URL，而不是后续页面的列表URL。

我将extract_post_urls（）函数放在控制转到下一页的函数中。我做了一个列表链接[]，并有links.append(self.extract_post_urls())给出了URL列表，但是当我转到下一页时，该列表被附加了，但是第一页的URL而不是当前页面的URL。刮

我尝试了.append和.extend，但都只给了我第一页的网址列表，而不是当前页面的网址

控件转到下一页

def load_craigslist_url(self):

    links = []
    data = []  # List that will hold all of the information scraped
    self.driver.get(self.url)
    while True:
        try:
            wait = WebDriverWait(self.driver, self.delay)
            wait.until(EC.presence_of_element_located((By.ID,
                                                       "searchform")))  # Once the driver find the web elemet 'searchform' it knows the full page is loaded
            print("Page is loaded")

            data.append(self.extract_post_information())  # Append the data found from the extract post method into the data list
            links.append(self.extract_post_urls())
            WebDriverWait(self.driver, 2).until(
                EC.element_to_be_clickable((By.XPATH,
                                            '//*[@id="searchform"]/div[3]/div[3]/span[2]/a[3]'))).click()  # Wait after clicking 'next page' if there is a next page else break
        except:
            print("Last page")
            break

    return data, links

从当前页面抓取列表数据

获取当前页面上每个列表的网址链接

def extract_post_urls(self):
    url_list = []
    html_page = urllib.request.urlopen(self.url)
    soup = BeautifulSoup(html_page, "lxml")
    for link in soup.findAll("a", {"class": "result-title hdrlnk"}):
        # print(link)
        url_list.append(link["href"])
    return url_list

我的列表看起来像

links
                   0: [https://newyork.....]
                   1: [https://newyork.....]

在craigslist上某个项目的页数尽可能多。索引等于被抓取的页面。因此，第1页列出的URL在索引0中，第2页列出在索引1中，依此类推

如何获得漂亮的汤功能以从下一页获取URL

0 个答案: