我正在学习并尝试从this website收集论据。我正在使用BeautifulSoup和Selenium来做到这一点。
现在我可以收集所有参数,但回复评论。要查看回复,我们需要单击红色箭头(查看回复)。请注意,并非所有评论都包含回复。
在我看来,我能想到两种解决方案:
1.作为绿色的亮点,我注意到每个参数都包含唯一的ID(辅助)。我需要Selenium点击红色箭头,以便列出回复。但是,我如何导航到查看回复?我只知道援助和查看回复具有相同的标记名称。
2.使用Selenium单击所有评论中的所有查看回复,然后使用BeautifulSoup获取标记中的值。我认为第二种选择更容易。以下代码是我为第二个选项所做的:
while True:
try:
wait3 = WebDriverWait(driver, 5)
btn_view_reply = wait3.until(EC.element_to_be_clickable((By.CLASS_NAME, "msg-contain")))
btn_view_reply.click()
wait4 = WebDriverWait(driver, 3)
loadReply = wait4.until(EC.presence_of_element_located((By.CLASS_NAME,"msg-contain")))
content = driver.execute_script("return document.documentElement.outerHTML;")
except TimeoutException:
break
问题是Selenium不会移动到下一个View Replies按钮。你能就此提出一些建议吗?谢谢。
答案 0 :(得分:0)
这是一种不同的方法:
(aid
属性唯一标识帖子):
如果有回复:
实现:
from pprint import pprint
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Firefox()
driver.get('http://www.debate.org/opinions/is-global-climate-change-man-made')
wait = WebDriverWait(driver, 5)
# wait for posts to load
posts_xpath = '//div[@id="debate"]/div/ul/li[@aid]'
wait.until(EC.presence_of_element_located((By.XPATH, posts_xpath)))
# collect posts data
posts = []
for post in driver.find_elements_by_xpath(posts_xpath):
aid = post.get_attribute('aid')
contents = post.find_element_by_tag_name('p').text
replies = []
# check how many replies are there
reply_count = int(post.find_element_by_class_name('m-cnt').text)
if reply_count > 0:
post.find_element_by_class_name('msg-contain').click()
replies_xpath = '//li[@aid="{aid}"]//div[@class="comment-container"]//div[@class="comment"]/div[@class="comment-body"]'.format(aid=aid)
wait.until(EC.presence_of_element_located((By.XPATH, replies_xpath)))
for reply in driver.find_elements_by_xpath(replies_xpath):
replies.append(reply.text)
posts.append({'contents': contents, 'replies': replies})
pprint(posts)
这会产生以下输出:
[{'contents': u'If we have never had been here, then the earth would have gone on healthy and the way it should, but since we are here there are disruptive thing on the earth that are causing the destruction on the earth. And most of these factors are man made, if not all are man made',
'replies': [u"Only 8% of the world's CO2 comes from humans though..."]},
{'contents': u"Yes, I know the temperature changes, but that's natural, it happens another way... now there is a lot of CO2 in the air, the long-wave radiation increases and the heat gets trapped. Greenland is partially melting. Why is it melting now? Well, I guess it kind of melted before, but still, why is it melting more than other times? I have to go with man-made, I need more proof that it is a natural cycle, this time.",
'replies': [u'The trash we burn and we burry is causing that to happen so its just going to melt']},
{'contents': u'The most prominent reason is that most of the energy we depend on is coming from the fossil fuels and its burning produced carbon dioxide, the main cause for global climate change. Another reason is that forests are disappearing because of many purposes for human life. Without strong change in our energy source and use, global climate change will get worse.',
'replies': []},
...
{'contents': u'Solar flares happen on a systematic basis. There are varying degrees of these flares. When these low, medium, and high impact flares happen at the same time an incredibly large amount of bad chemicals are released and cause more impact than human activity. An example is if the low impact flares happened every two years, the medium impact may happen every four years and the high impact may happen every eight years so they would collide quite often.',
'replies': []}]
您仍然需要改进解决方案以处理底部的“加载更多参数”按钮以提取更多参数,但这应该为您提供一个良好的起点。