获取html动态内容python3

时间:2018-07-29 12:50:24

标签: python html python-3.x

我想从网站上获取html动态内容的一部分,我可以在“检查元素”中看到此内容,但不能在“查看源代码”中看到。我尝试使用BeautifulSoup和Selenium库没有成功,因为加载页面后,我需要按一些屏幕按钮来加载内容。

例如,在网站http://play.typeracer.com中,我可以加载其html源代码,但是在按网页上的“ Practice”后,无法加载显示的内容。 (表格和文字)

希望我很明确,谢谢您的关注

1 个答案:

答案 0 :(得分:2)

以下是使用Selenium和Firefox的解决方案:

  1. 打开浏览器窗口并导航到URL
  2. 等到练习链接出现
  3. 提取所有包含部分文本的span元素
  4. 创建输出字符串。如果第一个单词只有一个字母,那么将只有2个span元素。如果该单词有多个字母,则将包含3个span元素。
NavigationView navigationView = findViewById(R.id.nav_view);
//R.id.nav_view the id of the navigation drawer

View drawerHead = navigationView.getHeaderView(0);
//0 index of the header

TextView userName = drawerHead.findViewById(R.id.username);

更新

以防万一,您以后还要自动输入内容;)

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC


url = 'http://play.typeracer.com/'
browser = webdriver.Firefox()
browser.get(url)

try:  # waiting till link is loaded
    element = WebDriverWait(browser, 30).until(
        EC.presence_of_element_located((By.LINK_TEXT, 'Practice')))
finally:  # link loaded -> click it
    element.click()

try:  # wait till text is loaded
    WebDriverWait(browser, 30).until(
        EC.presence_of_element_located((By.XPATH, '//span[@unselectable="on"]')))
finally:  # extract text 
    spans = browser.find_elements_by_xpath('//span[@unselectable="on"]')
    if len(spans) == 2:  # first word has only one letter
        text = f'{spans[0].text} {spans[1].text}'
    elif len(spans) == 3:  # first word has more than one letter
        text = f'{spans[0].text}{spans[1].text} {spans[2].text}'
    else:
        text = ' '.join([span.text for span in spans])
        print('special case that is not handled yet: {text}')


print(text)
>>> 'Scissors cuts paper. Paper covers rock. Rock crushes lizard. Lizard poisons Spock. Spock smashes scissors. Scissors decapitates lizard. Lizard eats paper. Paper disproves Spock. Spock vaporizes rock. And as it always has, rock crushes scissors.'

try: txt_input = WebDriverWait(browser, 30).until( EC.presence_of_element_located((By.XPATH, '//input[@class="txtInput" and @autocorrect="off"]'))) finally: for letter in text: txt_input.send_keys(letter) 块的原因是,我们必须等到内容加载完毕-有时可能要花很多时间。