我有一个看起来像这样的函数,它在给出url时在who.is上查找域:
import whois
def who_is(url):
w = whois.whois(url)
return w.text
以粗字符串形式返回以下内容:
Domain name:
amazon.co.uk
Registrant:
Amazon Europe Holding Technologies SCS
Registrant type:
Unknown
Registrant's address:
65 boulevard G-D. Charlotte
Luxembourg City
Luxembourg
LU-1311
Luxembourg
Data validation:
Nominet was able to match the registrant's name and address against a 3rd party data source on 10-Dec-2012
Registrar:
Amazon.com, Inc. t/a Amazon.com, Inc. [Tag = AMAZON-COM]
URL: http://www.amazon.com
Relevant dates:
Registered on: before Aug-1996
Expiry date: 05-Dec-2020
Last updated: 23-Oct-2013
Registration status:
Registered until expiry date.
Name servers:
ns1.p31.dynect.net
ns2.p31.dynect.net
ns3.p31.dynect.net
ns4.p31.dynect.net
pdns1.ultradns.net
pdns2.ultradns.net
pdns3.ultradns.org
pdns4.ultradns.org
pdns5.ultradns.info
pdns6.ultradns.co.uk 204.74.115.1 2610:00a1:1017:0000:0000:0000:0000:0001
WHOIS lookup made at 21:09:42 10-May-2017
--
This WHOIS information is provided for free by Nominet UK the central registry
for .uk domain names. This information and the .uk WHOIS are:
Copyright Nominet UK 1996 - 2017.
You may not access the .uk WHOIS or use any data from it except as permitted
by the terms of use available in full at http://www.nominet.uk/whoisterms,
which includes restrictions on: (A) use of the data for advertising, or its
repackaging, recompilation, redistribution or reuse (B) obscuring, removing
or hiding any or all of this notice and (C) exceeding query rate or volume
limits. The data is provided on an 'as-is' basis and may lag behind the
register. Access may be withdrawn or restricted at any time.
所以只要看一下,我就可以看到布局是将它变成字典,但不确定如何以最有效的方式实现它。我需要删除底部不需要的文本,并删除所有换行符和缩进。单独完成的效率不高。我希望能够将任何url传递给函数并使用字典来处理。任何帮助将非常感激。
期望的输出将是:
dict = {
'Domain name':'amazon.co.uk',
'Registrant':'Amazon Europe Holding Technologies'
'Registrant type': 'Unknown'
and so on for all the available fields.
}
到目前为止,我已尝试使用remove函数删除所有\ n新行和\ r \ n,然后使用replace函数替换所有缩进。但是,我完全不确定如何删除底部的大量文本。
python-whois文档告诉您只打印w
但是这样做会返回以下内容:
{
"domain_name": null,
"registrar": null,
"registrar_url": "http://www.amazon.com",
"status": null,
"registrant_name": null,
"creation_date": "before Aug-1996",
"expiration_date": "2020-12-05 00:00:00",
"updated_date": "2013-10-23 00:00:00",
"name_servers": null
}
正如您所看到的那样,大多数值都是null
,但在返回w.text
时,它们确实有值
答案 0 :(得分:1)
显然,您正在使用python-whois。
查看example。您可以以结构化形式获取所有数据,而不是需要解析的文本:
import whois
w = whois.whois('webscraping.com')
w.expiration_date # dates converted to datetime object
# datetime.datetime(2013, 6, 26, 0, 0)
w.text # the content downloaded from whois server
# u'\nWhois Server Version 2.0\n\nDomain names in the .com and .net ...'
print w # print values of all found attributes
# creation_date: 2004-06-26 00:00:00
# domain_name: [u'WEBSCRAPING.COM', u'WEBSCRAPING.COM']
# emails: [u'WEBSCRAPING.COM@domainsbyproxy.com', u'WEBSCRAPING.COM@domainsbyproxy.com']
# expiration_date: 2013-06-26 00:00:00
您可以从whois对象(w
)逐个获取所需的所有属性,并将它们存储在dict
中,或者只是将对象本身传递给需要这些信息的任何函数。
w.text
中是否有任何信息无法作为w
的属性进行访问?
它适用于我使用与您相同的示例网址。
pip install python-whois
pip freeze |grep python-whois
# python-whois==0.6.5
import whois
w = whois.whois("amazon.co.uk")
w
# {'updated_date': datetime.datetime(2013, 10, 23, 0, 0), 'creation_date': 'before Aug-1996', 'registrar': None, 'registrar_url': 'http://www.amazon.com', 'domain_name': None, 'expiration_date': datetime.datetime(2020, 12, 5, 0, 0), 'name_servers': None, 'status': None, 'registrant_name': None}
如果认为我在解析器中发现了这个问题。
正则表达式不应该是
'Registrant:\n\s*(.*)'
但是
'Registrant:\r\n\s*(.*)'
你可以尝试在本地克隆whois
并像这样修改它(添加\r
),然后如果它有效,建议这个补丁,或者至少在bug report中提到这个
答案 1 :(得分:0)
试试这个:
from collections import OrderedDict
key_value=OrderedDict() #use dict() if order of keys is not important
for block in textstring.split("\n\n"): #textstring contains the string of w.text.
try:
key_value[block.split(":\n")[0].strip()] = '\n'.join(element.strip() for element in block.split(":\n")[1].split('\n'))
except IndexError:
pass
#print the result
for key in key_value:
print(key)
print(key_value[key])
print("\n")