抓取网站:cbc
背景信息:
Chrome网站驱动器
python3
我的目标:
从每个vf-comment-thread类中提取注释。
HTML的部分结构如下所示:
<div class="vf-commenting vf-comments-widget">
...
<div class="vf-horizontal-list vf3-conversations-list vf3-conversations-list--comments">
<div class="vf-comment-thread"> ... </div>
<div class="vf-comment-thread"> ... </div>
<div class="vf-comment-thread"> ... </div>
...
</div>
...
</div>
问题:当我使用硒来定位时,
"vf-horizontal-list vf3-conversations-list vf3-conversations-list--comments"
并将其存储在变量:“ comm”中,然后打印[i.get_attribute("class") for i in comm.find_elements_by_css_selector("*")]
。应该给我显示一个像[..., "vf-comment-thread", "vf-comment-thread" , "vf-comment-thread", ...]
这样的列表。但是,我得到的列表是空的。
我的确切命令:
wait = WebDriverWait(self.driver, 14)
comm = wait.until(ec.presence_of_element_located((By.CLASS_NAME, "vf-commenting")))
wait = WebDriverWait(comm, 14)
comms = wait.until(ec.presence_of_element_located((By.XPATH, ".//div[@class = 'vf-horizontal-list']")))
print([i.get_attribute("class") for i in comms.find_elements_by_css_selector("*")])
Output: []
答案 0 :(得分:0)
您面临的问题是评论是动态的 由Java脚本生成,因此您需要向下滚动以加载它们 首先
from time import sleep
from selenium import webdriver
#Open Browser
driver = webdriver.Chrome()
def ScrollDown(interal=3.5,looper=20):
scroll_delay = interal
count = 0
''' Get scroll height'''
last_height = driver.execute_script("return document.body.scrollHeight")
while count < looper:
print('Scrolling down to bottom loop {}/{}'.format(count+1,looper))
''' Scroll down to bottom'''
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
''' Wait to load page'''
print('sleeping {} secs'.format(interal))
sleep(scroll_delay)
''' Calculate new scroll height and compare with last scroll height'''
new_height = driver.execute_script("return document.body.scrollHeight")
if new_height == last_height:
break
last_height = new_height
count += 1
driver.get('https://www.cbc.ca/news/canada/new-brunswick/dieppe-newfoundland-mail-packages-1.5367640')
# this will scroll down the page till all the dynamic content is loaded
ScrollDown()
#Method 1 get all children using *
childer_xpath = "//div[contains(@class, 'vf-horizontal-list') and contains(@class ,'conversations-list--comments')]/*"
all_children = driver.find_elements_by_xpath(childer_xpath)
if all_children:
print([i.get_attribute("class") for i in all_children])
#Method 2 get all children using children tag name
alt_childer_xpath = "//div[contains(@class, 'vf-horizontal-list') and contains(@class ,'conversations-list--comments')]/div"
comm = driver.find_elements_by_xpath(alt_childer_xpath)
if comm:
print([i.get_attribute("class") for i in comm])
#Method 3 get all children using xpath of the parent then loop throuth it's children
Parent_Cooments_xpath = "//div[contains(@class, 'vf-horizontal-list') and contains(@class ,'conversations-list--comments')]"
parent_tag = driver.find_elements_by_xpath(Parent_Cooments_xpath)
if parent_tag:
print([i.get_attribute("class") for i in parent_tag[0].find_elements_by_xpath('./*')])
print([i.get_attribute("class") for i in parent_tag[0].find_elements_by_xpath('*')])
输出:
['vf-comment-thread', 'vf-comment-thread', 'vf-comment-thread']