因此,我正在从Craigslist上刮取列表,并且每次Web驱动程序转到下一页时,我的标题,价格和日期列表都将被覆盖。最后,我的.csv文件和MongoDB集合中唯一的数据是最后一页上的列表。
我尝试移动列表的实例化,但仍会覆盖。
从页面提取列表信息的功能
def extract_post_information(self):
all_posts = self.driver.find_elements_by_class_name("result-row")
dates = []
titles = []
prices = []
for post in all_posts:
title = post.text.split("$")
if title[0] == '':
title = title[1]
else:
title = title[0]
title = title.split("\n")
price = title[0]
title = title[-1]
title = title.split(" ")
month = title[0]
day = title[1]
title = ' '.join(title[2:])
date = month + " " + day
if not price[:1].isdigit():
price = "0"
int(price)
titles.append(title)
prices.append(price)
dates.append(date)
return titles, prices, dates
该函数将运行到url并转到下一页,直到没有下一页为止
def load_craigslist_url(self):
self.driver.get(self.url)
while True:
try:
wait = WebDriverWait(self.driver, self.delay)
wait.until(EC.presence_of_element_located((By.ID, "searchform")))
print("Page is loaded")
self.extract_post_information()
WebDriverWait(self.driver, 2).until(
EC.element_to_be_clickable((By.XPATH, '//*[@id="searchform"]/div[3]/div[3]/span[2]/a[3]'))).click()
except:
print("Last page")
break
我的主要人
if __name__ == "__main__":
filepath = '/home/diego/git_workspace/PyScrape/data.csv' # Filepath of written csv file
location = "philadelphia" # Location Craigslist searches
postal_code = "19132" # Postal code Craigslist uses as a base for 'MILES FROM ZIP'
max_price = "700" # Max price Craigslist limits the items too
query = "graphics+card" # Type of item you are looking for
radius = "400" # Radius from postal code Craigslist limits the search to
# s = 0
scraper = CraigslistScraper(location, postal_code, max_price, query, radius)
scraper.load_craigslist_url()
titles, prices, dates = scraper.extract_post_information()
d = [titles, prices, dates]
export_data = zip_longest(*d, fillvalue='')
with open('data.csv', 'w', encoding="utf8", newline='') as my_file:
wr = csv.writer(my_file)
wr.writerow(("Titles", "Prices", "Dates"))
wr.writerows(export_data)
my_file.close()
# scraper.kill()
scraper.upload_to_mongodb(filepath)
我希望它要做的是从一页中获取所有信息,转到下一页,获取所有那部分页面信息,并将其附加到extract_post_information函数中的三个列表标题,价格和日期。一旦没有其他下一页,请从这三个列表中创建一个名为d的列表(在我的主函数中看到)
我应该将extract_post_information函数放在load_craigslist_url函数中吗?还是我必须调整在extract_post _informtaion函数中实例化三个列表的位置?
答案 0 :(得分:1)
在load_craigslist_url()
函数中,您正在调用self.extract_post_information()
而不保存返回的信息。