Python YouTube页面令牌问题

时间:2017-12-26 02:18:13

标签: python web-scraping python-requests

当我运行下面的代码时,我遇到了间歇性的问题。我正在尝试通过按“加载更多”按钮来收集ajax调用中的所有page_tokens(如果存在)。基本上,我正试图从YouTube频道获取所有页面令牌。

有时会检索令牌,有时则不会。我最好的猜测是我在“find_embedded_pa​​ge_token”函数中犯了错误,或者我需要在某处插入某种延迟/睡眠。

以下是完整代码:

import requests
import pprint
import urllib.parse
import lxml

def find_XSRF_token(html, key, num_chars=2):
    pos_begin = html.find(key) + len(key) + num_chars
    pos_end = html.find('"', pos_begin)
    return html[pos_begin: pos_end]

def find_page_token(html, key, num_chars=2):
    pos_begin = html.find(key) + len(key) + num_chars
    pos_end = html.find('&', pos_begin)
    return html[pos_begin: pos_end]

def find_embedded_page_token(html, key, num_chars=2):
    pos_begin = html.find(key) + len(key) + num_chars
    pos_end = html.find('&', pos_begin)
    excess_str = html[pos_begin: pos_end]
    sep = '\\'
    rest = excess_str.split(sep,1)[0]
    return rest

sxeVid = 'https://www.youtube.com/user/sxephil/videos'
ajaxStr = 'https://www.youtube.com/browse_ajax?action_continuation=1&continuation='

s = requests.Session()
r = s.get(sxeVid)
html = r.text


session_token = find_XSRF_token(html, 'XSRF_TOKEN', 4)
page_token = find_page_token(html, ';continuation=', 0)
print(page_token)

s = requests.Session()
r = s.get(ajaxStr+page_token)
ajxHtml = r.text
ajax_page_token = find_embedded_page_token(ajxHtml, ';continuation=', 0)


while page_token:
    ajxBtn = ajxHtml.find('data-uix-load-more-href=')
    if ajxBtn != -1:
        s = requests.Session()
        r = s.get(ajaxStr+ajax_page_token)
        ajxHtml = r.text
        ajax_page_token = find_embedded_page_token(ajxHtml, ';continuation=', 0)
        print(ajax_page_token)
    else:
        break

这是随意返回的意外情况。它不仅会提取令牌,还会提取所需切断后的HTML。

4qmFsgJAEhhVQ2xGU1U5X2JVYjRSYzZPWWZUdDVTUHcaJEVnWjJhV1JsYjNNZ0FEZ0JZQUZxQUhvQk1yZ0JBQSUzRCUzRA%253D%253D"><span class="yt-uix-button-content">  <span class="load-more-loading hid">
      <span class="yt-spinner">
      <span class="yt-spinner-img  yt-sprite" title="Loading icon"></span>

我期待的预期反应是:

4qmFsgJAEhhVQ2xGU1U5X2JVYjRSYzZPWWZUdDVTUHcaJEVnWjJhV1JsYjNNZ0FEZ0JZQUZxQUhvQk1yZ0JBQSUzRCUzRA%253D%253D
4qmFsgJAEhhVQ2xGU1U5X2JVYjRSYzZPWWZUdDVTUHcaJEVnWjJhV1JsYjNNZ0FEZ0JZQUZxQUhvQk5MZ0JBQSUzRCUzRA%253D%253D
4qmFsgJAEhhVQ2xGU1U5X2JVYjRSYzZPWWZUdDVTUHcaJEVnWjJhV1JsYjNNZ0FEZ0JZQUZxQUhvQk5iZ0JBQSUzRCUzRA%253D%253D
4qmFsgJAEhhVQ2xGU1U5X2JVYjRSYzZPWWZUdDVTUHcaJEVnWjJhV1JsYjNNZ0FEZ0JZQUZxQUhvQk5yZ0JBQSUzRCUzRA%253D%253D
4qmFsgJAEhhVQ2xGU1U5X2JVYjRSYzZPWWZUdDVTUHcaJEVnWjJhV1JsYjNNZ0FEZ0JZQUZxQUhvQk43Z0JBQSUzRCUzRA%253D%253D
4qmFsgJAEhhVQ2xGU1U5X2JVYjRSYzZPWWZUdDVTUHcaJEVnWjJhV1JsYjNNZ0FEZ0JZQUZxQUhvQk9MZ0JBQSUzRCUzRA%253D%253D
4qmFsgJAEhhVQ2xGU1U5X2JVYjRSYzZPWWZUdDVTUHcaJEVnWjJhV1JsYjNNZ0FEZ0JZQUZxQUhvQk9iZ0JBQSUzRCUzRA%253D%253D

非常感谢任何帮助。另外,如果我的标签错误,请告诉我+/-标签。

0 个答案:

没有答案