列表已被覆盖

时间:2019-09-03 19:31:29

标签: python selenium

因此,我正在从Craigslist上刮取列表,并且每次Web驱动程序转到下一页时,我的标题,价格和日期列表都将被覆盖。最后,我的.csv文件和MongoDB集合中唯一的数据是最后一页上的列表。

我尝试移动列表的实例化,但仍会覆盖。

从页面提取列表信息的功能

    def extract_post_information(self):
    all_posts = self.driver.find_elements_by_class_name("result-row")

    dates = []
    titles = []
    prices = []

    for post in all_posts:
        title = post.text.split("$")

        if title[0] == '':
            title = title[1]
        else:
            title = title[0]

        title = title.split("\n")
        price = title[0]

        title = title[-1]
        title = title.split(" ")
        month = title[0]
        day = title[1]
        title = ' '.join(title[2:])
        date = month + " " + day

        if not price[:1].isdigit():
            price = "0"
        int(price)

        titles.append(title)
        prices.append(price)
        dates.append(date)

    return titles, prices, dates

该函数将运行到url并转到下一页,直到没有下一页为止

def load_craigslist_url(self):
    self.driver.get(self.url)
    while True:
        try:
            wait = WebDriverWait(self.driver, self.delay)
            wait.until(EC.presence_of_element_located((By.ID, "searchform")))
            print("Page is loaded")
            self.extract_post_information()
            WebDriverWait(self.driver, 2).until(
                EC.element_to_be_clickable((By.XPATH, '//*[@id="searchform"]/div[3]/div[3]/span[2]/a[3]'))).click()
        except:
            print("Last page")
            break

我的主要人

if __name__ == "__main__":
filepath = '/home/diego/git_workspace/PyScrape/data.csv'  # Filepath of written csv file
location = "philadelphia"  # Location Craigslist searches
postal_code = "19132"  # Postal code Craigslist uses as a base for 'MILES FROM ZIP'
max_price = "700"  # Max price Craigslist limits the items too
query = "graphics+card"  # Type of item you are looking for
radius = "400"  # Radius from postal code Craigslist limits the search to
# s = 0

scraper = CraigslistScraper(location, postal_code, max_price, query, radius)

scraper.load_craigslist_url()

titles, prices, dates = scraper.extract_post_information()

d = [titles, prices, dates]

export_data = zip_longest(*d, fillvalue='')
with open('data.csv', 'w', encoding="utf8", newline='') as my_file:
    wr = csv.writer(my_file)
    wr.writerow(("Titles", "Prices", "Dates"))
    wr.writerows(export_data)
    my_file.close()
    # scraper.kill()
scraper.upload_to_mongodb(filepath)

我希望它要做的是从一页中获取所有信息,转到下一页,获取所有那部分页面信息,并将其附加到extract_post_information函数中的三个列表标题,价格和日期。一旦没有其他下一页,请从这三个列表中创建一个名为d的列表(在我的主函数中看到)

我应该将extract_post_information函数放在load_craigslist_url函数中吗?还是我必须调整在extract_post _informtaion函数中实例化三个列表的位置?

1 个答案:

答案 0 :(得分:1)

load_craigslist_url()函数中,您正在调用self.extract_post_information()而不保存返回的信息。