Python-搜寻网页,仅在滚动后显示信息

时间:2019-02-03 01:38:56

标签: python selenium web-scraping beautifulsoup

我正在尝试为每个标头中的参数抓取this web page

我尝试做的是一直滚动到页面底部,以便显示所有参数(到达页面底部不需要很长时间),然后从中提取html代码。在那里。

这就是我所做的。顺便说一下,我从here获得了滚动代码。

SCROLL_PAUSE_TIME = 0.5

#launch url
url = 'https://en.arguman.org/fallacies'

#create chrome sessioin
driver = webdriver.Chrome()
driver.implicitly_wait(30)
driver.get(url)

#get scroll height
last_height = driver.execute_script("return document.body.scrollHeight")


while True:
    # Scroll down to bottom
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

    # Wait to load page
    time.sleep(SCROLL_PAUSE_TIME)

    # Calculate new scroll height and compare with last scroll height
    new_height = driver.execute_script("return document.body.scrollHeight")
    if new_height == last_height:
        break
    last_height = new_height

http = urllib3.PoolManager()
response = http.request('GET', url)
soup = BeautifulSoup(response.data, 'html.parser')

claims_h2 = soup('h2')
claims =[]
for c in claims_h2:
    claims.append(c.get_text())

for c in claims:
    print (c)

这就是我得到的,这些都是您无需滚动即可看到的所有参数,并且无需在页面上添加更多内容。

Plants should have the right to vote.
Plants should have the right to vote.
Plants should have the right to vote.
Postmortem organ donation should be opt-out
Jimmy Kimmel should not bring up inaction on gun policy (now)
A monarchy is the best form of government
A monarchy is the best form of government
El lenguaje inclusivo es innecesario
Society suffers the most when dealing with people having mental disorders
Illegally downloading copyrighted music and other files is morally wrong.

如果您一直浏览并滚动到页面底部,您将看到这些参数以及许多其他参数。

基本上,我的代码似乎无法解析更新的html代码。

1 个答案:

答案 0 :(得分:3)

使用Selenium打开站点,进行所有滚动,然后使用from pywikibot.diff import PatchManager PatchManager(first_rev_text, second_rev_text).print_hunks() # print_hunks is for interactive changes, but you can work with any internals api here (that might not be simple). 再次发出请求是没有意义的。这两个过程是完全独立且无关的。

相反,滚动完成时,传递urllibdriver.page_source,并提取从那里含量:

BeautifulSoup

结果:

Plants should have the right to vote.
Plants should have the right to vote.
Plants should have the right to vote.
Postmortem organ donation should be opt-out
Jimmy Kimmel should not bring up inaction on gun policy (now)
A monarchy is the best form of government
A monarchy is the best form of government
El lenguaje inclusivo es innecesario
Society suffers the most when dealing with people having mental disorders
Illegally downloading copyrighted music and other files is morally wrong.
Semi-colons are pointless in Javascript
You can't measure how good a programming language is.
You can't measure how good a programming language is.
Semi-colons are pointless in Javascript
Semi-colons are pointless in Javascript
Semi-colons are pointless in Javascript
...