无法使用python从html内容获取链接

时间:2016-06-10 15:25:03

标签: python web-scraping urllib2 urllib

以下是我使用的网址:

http://www.protect-stream.com/PS_DL_xODN4o5HjLuqzEX5fRNuhtobXnvL9SeiyYcPLcqaqqXayD8YaIvg9Qo80hvgj4vCQkY95XB7iqcL4aF1YC8HRg_i_i

实际上在这个页面上,我正在寻找的链接在加载页面后可能是5秒钟。

我在5秒后看到一个帖子请求: http://www.protect-stream.com/secur.php 像这样的数据:

k=2AE_a,LHmb6kSC_c,sZNk4eNixIiPo_c,_c,Gw4ERVdriKuHJlciB1uuy_c,Sr7mOTQVUhVEcMlZeINICKegtzYsseabOlrDb_a,LmiP80NGUvAbK1xhbZGC6OWMtIaNF12f0mYA4O0WxBkmAtz75kpYcrHzxtYt32hCYSp0WjqOQR9bY_a,ofQtw_b,

我没有从'k'值来自哪里?

他们是否知道如何使用python获取'k'值?

1 个答案:

答案 0 :(得分:1)

这不会是微不足道的。 k参数值为"隐藏"深嵌在嵌套iframe中的script元素内部。以requests + BeautifulSoup方式获取k值:

import re
from urlparse import urljoin
# Python 3: from urllib.parse import urljoin

import requests
from bs4 import BeautifulSoup

base_url = "http://www.protect-stream.com"
with requests.Session() as session:
    response = session.get("http://www.protect-stream.com/PS_DL_xODN4o5HjLuqzEX5fRNuhtobXnvL9SeiyYcPLcqaqqXayD8YaIvg9Qo80hvgj4vCQkY95XB7iqcL4aF1YC8HRg_i_i")

    # get the top frame url
    soup = BeautifulSoup(response.content, "html.parser")
    src = soup.select_one('iframe[src^="frame.php"]')["src"]
    frame_url = urljoin(base_url, src)

    # get the nested frame url
    response = session.get(frame_url)
    soup = BeautifulSoup(response.content, "html.parser")
    src = soup.select_one('iframe[src^="w.php"]')["src"]
    frame_url = urljoin(base_url, src)

    # get the frame HTML source and extract the "k" value
    response = session.get(frame_url)
    soup = BeautifulSoup(response.content, "html.parser")
    script = soup.find("script", text=lambda text: text and "k=" in text).get_text(strip=True)

    k_value = re.search(r'var k="(.*?)";', script).group(1)
    print(k_value)

打印:

YjfH9430zztSYgf7ItQJ4grv2cvH3mT7xGwv32rTy2HiB1uuy_c,Sr7mOTQVUhVEcMlZeINICKegtzYsseabOlrDb_a,LmiP80NGUvAbK1xhbZGC6OWMtIaNF12f0mYA4O0WXhmwUC0ipkPRkLQepYHLyF1U0xvsrzHMcK2XBCeY3_a,O_b,