python3.6
+ win10
当我从https://ipinfo.io/AS...
抓取诸如https://ipinfo.io/countries/us
之类的详细数据页面时,我从请求模块中得到了不同的结果,有时页面资源不完整。
如下所示,我举了两个例子:
import requests
headers = {
"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36",
}
(1)请求页面https://ipinfo.io/AS13489
(完整页面)
complete_result = requests.get('https://ipinfo.io/AS13489', headers=headers)
print(complete_result.text)
结果获取完整的html页面:
<!DOCTYPE html>
<html>
<head>
...
</body>
</html>
(2)请求页面https://ipinfo.io/AS7018
(不完整)
not_complete_result = requests.get('https://ipinfo.io/AS7018', headers=headers)
print(not_complete_result.text)
结果只是获取不完整的html页面:
</tr>
<tr class="hidden">
...
</body>
</html>
在我的尝试中,除selenium
之外的(3)都不起作用:
from selenium import webdriver
browser = webdriver.Chrome()
browser.get('https://ipinfo.io/AS7018')
browser.implicitly_wait(5)
print(browser.page_source)
结果不完整
256
</td>
</tr>
<tr class="hidden">
...
</iframe>
</html>
更新我所需的数据图片,现在我的困惑是有时这些零件数据消失了。
缺少html内容的一部分:
更新我的代码:
import re
import requests
headers = {
"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36",
}
# s = requests.get('https://ipinfo.io/AS7018', headers=headers).text
# not work , s get a not complete html cntent.
s = requests.get('https://ipinfo.io/AS13489', headers=headers).text
asn_code, name = re.search(r'<h3 class="font-semibold m-0 t-xs-24">(?P<ASN_CODE>AS\d+) (?P<NAME>[\w.\s]+)</h3>',s).groups()
country = re.search(r'.*href="/countries.*">(?P<COUNTRY>.*)?</a>',s).group("COUNTRY")
registry = re.search(r'Registry.*?pb-md-1">(?P<REGISTRY>.*?)</p>',s, re.S).group("REGISTRY").strip()
ip = re.search(r'IP Addresses.*?pb-md-1">(?P<IP>.*?)</p>',s, re.S).group("IP").strip()
print(asn_code, name, country, registry, ip)
# AS13489 EPM Telecomunicaciones S.A. E.S.P. Colombia lacnic 3,137,536
答案 0 :(得分:0)
您可以获取以下所需数据:
import requests
from lxml import html
headers = {
"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36",
}
not_complete_result = requests.get('https://ipinfo.io/AS7018', headers=headers)
source = html.fromstring(not_complete_result.text)
print(source.xpath('//div[contains(@class, "card-header")]/h3/text()[1]')[0])
# 'AS7018 AT&T Services, Inc.'
for item in source.xpath('(//div[contains(@class, "card-body")])[1]//div[contains(@class, "col-")]/p'):
print(item.text_content().strip())
# att.com
# United States
# 1996-07-30
# arin
# 80,925,184
# isp
# There are 78,008 domain names hosted across 34,656 IP addresses on this ASN.