我正在学习webscraping并致力于Eat24(Yelp的网站)。我能够从Yelp中删除基本数据,但无法做一些非常简单的事情:将数据附加到数据帧。这是我的代码,我已经注明了它,所以应该很容易理解。
from selenium import webdriver
import time
import pandas as pd
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.keys import Keys
driver = webdriver.Chrome()
#go to eat24, type in zip code 10007, choose pickup and click search
driver.get("https://new-york.eat24hours.com/restaurants/index.php")
search_area = driver.find_element_by_name("address_auto_complete")
search_area.send_keys("10007")
pickup_element = driver.find_element_by_xpath("//[@id='search_form']/div/table/tbody/tr/td[2]")
pickup_element.click()
search_button = driver.find_element_by_xpath("//*[@id='search_form']/div/table/tbody/tr/td[3]/button")
search_button.click()
#scroll up and down on page to load more of 'infinity' list
for i in range(0,3):
driver.execute_script("window.scrollTo(0,
document.body.scrollHeight);")
driver.execute_script("window.scrollTo(0,0);")
time.sleep(1)
#find menu urls
menu_urls = [page.get_attribute('href') for page in
driver.find_elements_by_xpath('//*[@title="View Menu"]')]
df = pd.DataFrame(columns=['name', 'menuitems'])
#collect menu items/prices/name from each URL
for url in menu_urls:
driver.get(url)
menu_items = driver.find_elements_by_class_name("cpa")
menu_items = [x.text for x in menu_items]
menu_prices = driver.find_elements_by_class_name('item_price')
menu_prices = [x.text for x in menu_prices]
name = driver.find_element_by_id('restaurant_name')
menuitems = dict(zip(menu_items, menu_prices))
df['name'] = name
df['menuitems'] = menuitems
df.to_csv('test.csv', index=False)
问题出在最后。它不是将menuitems + name添加到数据框中的连续行中。我已经尝试过使用.loc和其他功能,但它变得混乱,所以我删除了我的尝试。任何帮助将不胜感激!!
编辑:当for循环尝试将第二组menuitems / restaurant名称添加到数据框时,我得到的错误是“ValueError:值的长度与索引的长度不匹配”
答案 0 :(得分:0)
我想出了一个简单的解决方案,不知道为什么我之前没有想到它。我添加了一个“行”计数,在每次迭代时增加1,并使用.loc将数据放在第“行”行中
row = 0
for url in menu_urls:
row +=1
driver.get(url)
menu_items = driver.find_elements_by_class_name("cpa")
menu_items = [x.text for x in menu_items]
menu_prices = driver.find_elements_by_class_name('item_price')
menu_prices = [x.text for x in menu_prices]
name = driver.find_element_by_id('restaurant_name').text
menuitems = [dict(zip(menu_items, menu_prices))]
df.loc[row, 'name'] = name
df.loc[row, 'menuitems'] = menuitems
print df