为什么相同的请求方法具有不同的html响应结构?

时间:2019-01-15 09:01:51

标签: python selenium request python-requests

python3.6 + win10

当我从https://ipinfo.io/AS...抓取诸如https://ipinfo.io/countries/us之类的详细数据页面时,我从请求模块中得到了不同的结果,有时页面资源不完整。

如下所示,我举了两个例子:

import requests
headers = {
    "User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36",
}

(1)请求页面https://ipinfo.io/AS13489(完整页面)

complete_result = requests.get('https://ipinfo.io/AS13489', headers=headers)
print(complete_result.text)

结果获取完整的html页面:

<!DOCTYPE html>
<html>
<head>

...    

</body>

</html>

(2)请求页面https://ipinfo.io/AS7018(不完整)

not_complete_result = requests.get('https://ipinfo.io/AS7018', headers=headers)
print(not_complete_result.text)

结果只是获取不完整的html页面:


 </tr>

    <tr class="hidden">
...     
</body>

</html>

在我的尝试中,除selenium之外的

(3)都不起作用:

from selenium import webdriver

browser = webdriver.Chrome()
browser.get('https://ipinfo.io/AS7018')
browser.implicitly_wait(5)

print(browser.page_source)

结果不完整

            256

        </td>

    </tr>

    <tr class="hidden">  

...

</iframe>
</html>

更新我所需的数据图片,现在我的困惑是有时这些零件数据消失了。

enter image description here

缺少html内容的一部分:

enter image description here


更新我的代码:


import re
import requests

headers = {
    "User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36",
}

# s = requests.get('https://ipinfo.io/AS7018', headers=headers).text
# not work , s get a not complete html cntent.

s = requests.get('https://ipinfo.io/AS13489', headers=headers).text

asn_code, name = re.search(r'<h3 class="font-semibold m-0 t-xs-24">(?P<ASN_CODE>AS\d+) (?P<NAME>[\w.\s]+)</h3>',s).groups()

country = re.search(r'.*href="/countries.*">(?P<COUNTRY>.*)?</a>',s).group("COUNTRY")

registry = re.search(r'Registry.*?pb-md-1">(?P<REGISTRY>.*?)</p>',s, re.S).group("REGISTRY").strip()

ip = re.search(r'IP Addresses.*?pb-md-1">(?P<IP>.*?)</p>',s, re.S).group("IP").strip()


print(asn_code, name, country, registry, ip)
# AS13489 EPM Telecomunicaciones S.A. E.S.P. Colombia lacnic 3,137,536

1 个答案:

答案 0 :(得分:0)

您可以获取以下所需数据:

import requests
from lxml import html

headers = {
    "User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36",
}
not_complete_result = requests.get('https://ipinfo.io/AS7018', headers=headers)
source = html.fromstring(not_complete_result.text)

print(source.xpath('//div[contains(@class, "card-header")]/h3/text()[1]')[0])
#  'AS7018 AT&T Services, Inc.'
for item in source.xpath('(//div[contains(@class, "card-body")])[1]//div[contains(@class, "col-")]/p'):
    print(item.text_content().strip())
# att.com
# United States
# 1996-07-30
# arin
# 80,925,184
# isp
# There are 78,008 domain names hosted across 34,656 IP addresses on this ASN.