在执行简单的ip-address提取任务时,我发现该程序运行良好。但是在完整的网络爬行程序中,它无法生存并产生不均匀的结果。
这是我的ip-address代码片段:
#!/usr/bin/python3
import os
import re
def get_ip_address(url):
command = "host " + url
process = os.popen(command)
results = str(process.read())
marker = results.find("has address") + 12
n = (results[marker:].splitlines()[0])
m = re.search('\w+ \w+: \d\([A-Z]+\)', n)
if m is not None:
url_new = url[8:]
command = "host " + url_new
process = os.popen(command)
results = str(process.read())
marker = results.find("has address") + 12
return results[marker:].splitlines()[0]
print(get_ip_address("https://www.yahoo.com"))
完整的网络抓取程序如下所示:
#!/usr/bin/python3
from general import *
from domain_name import *
from ip_address import *
from nmap import *
from robots_txt import *
from whois import *
ROOT_DIR = "companies"
create_dir(ROOT_DIR)
def gather_info(name, url):
domain_name = get_domain_name(url)
ip_address = get_ip_address(url)
nmap = get_nmap('-F', ip_address)
robots_txt = get_robots_txt(url)
whois = get_whois(domain_name)
create_report(name, url, domain_name, nmap, robots_txt, whois, ip_address)
def create_report(name, full_url, domain_name, nmap, robots_txt, whois, ip_address):
project_dir = ROOT_DIR + '/' + name
create_dir(project_dir)
write_file(project_dir + '/full_url.txt', full_url)
write_file(project_dir + '/domain_name.txt', domain_name)
write_file(project_dir + '/nmap.txt', nmap)
write_file(project_dir + '/robots_txt.txt', robots_txt)
write_file(project_dir + '/whois.txt', whois)
write_file(project_dir + '/ip_address.txt', ip_address)
x = input("Enter the Company Name: ")
y = input("Enter the complete url of the company: ")
gather_info( x , y )
输入的输入如下所示:
root@nitin-Lenovo-G580:~/Desktop/web_scanning# python3 main.py
106.10.138.240
Enter the Company Name: Yahoo
Enter the complete url of the company: https://www.yahoo.com/
/bin/sh: 1: Syntax error: "(" unexpected
ip_address.txt中的输出是:
hoo.com/ not found: 3(NXDOMAIN)
看到的程序在运行时运行良好,并提供ip为106.10.138.240仍然在ip_address.txt中保存不同的东西 此外,我没有找到这个/ bin / sh语法错误是如何出现的。请帮帮我......
答案 0 :(得分:0)
抱歉,我没有足够的声誉来添加评论,所以我会在这里发布我的建议。
我认为问题来自process = os.popen(command)
中的def get_ip_address(url)
。您可以打印command
以查看它是否有效。
除了这个问题,还有一些建议:
尽量不要在导入中使用*
,因为它会使读者更难以跟踪代码。
学习pdb,它是一个python调试器,简单但功能强大,适用于小型甚至中型项目。最简单的方法是在您希望程序停止的行之前添加import pdb; pdb.set_trace()
,以便您可以逐行运行代码。
答案 1 :(得分:0)
我的第二个Joe Lin建议不在你的import语句中使用通配符。它会极大地污染您的命名空间,并可能产生奇怪的行为。
Python是“包含电池”,因此您可能应该利用requests
和urllib3
包来处理HTTP请求,谨慎使用subprocess
来执行命令,并签出{{1}用于网络抓取的包。它们各自的对象和方法返回的数据可能具有您尝试提取的内容。
尽可能地懒惰并依赖“现有技术”。
在scrapy
的前几行中,我注意到以下内容:
get_ip_address
如果我通过shell执行此命令,它将完全反映这一点:
def get_ip_address(url):
command = "host " + url
process = os.popen(command)
....
执行host http://www.foo.com
并阅读手册页:
man host
当您只想要一个IP地址或一个主机名时,您正在提供 host is a simple utility for performing DNS lookups. It is normally
used to convert names to IP addresses and vice versa. When no arguments
or options are given, host prints a short summary of its command line
arguments and options.
name is the domain name that is to be looked up. It can also be a
dotted-decimal IPv4 address or a colon-delimited IPv6 address, in which
case host will by default perform a reverse lookup for that address.
server is an optional argument which is either the name or IP address
of the name server that host should query instead of the server or
servers listed in /etc/resolv.conf.
一个URL。 URL包括方案,主机名和路径。您必须明确提取主机名,以使host
按照选择与之交互的方式工作。鉴于URL可能/可能不包括详细路径信息,您必须解开它:
host
我不相信问题是提供与此问题相关的所有代码。错误输出似乎是基于shell的,而不是传统的Python堆栈跟踪,可能是url= "http://www.yahoo.com/some_random/path"
# Split on "//" to extract scheme
_, host_and_path = url.split("//")
# Use .split() with maxsplit 1 to break this into pieces as desired
hostname , path = host_path.split("/", 1)
# # Use 'hostname' as input to the command
command = "host " + url
...
函数中的一个使用get_something
来执行您想要的shell命令。