用于从网页抓取网页的Python脚本,以查找其中存在的网址的IP地址

时间:2016-08-04 22:51:31

标签: python scripting web-crawler

我已经开始编写脚本,如下所述

import urllib2
from bs4 import BeautifulSoup

trg_url='http://timesofindia.indiatimes.com/'
req=urllib2.Request(trg_url)
handle=urllib2.urlopen(req)
page_content=handle.read()
soup=BeautifulSoup(page_content,"html")
new_list=soup.find_all('a')


for link in new_list:
    print link.get('href')

但现在我卡住了,因为我得到了下面提到的输出

http://mytimes.indiatimes.com/?channel=toi
https://www.facebook.com/TimesofIndia
https://twitter.com/timesofindia
https://plus.google.com/117150671992820587865?prsrc=3
http://timesofindia.indiatimes.com/rss.cms
https://www.youtube.com/user/TimesOfIndiaChannel
javascript:void(0);
http://timesofindia.indiatimes.com
javascript://
http://beautypageants.indiatimes.com/
http://photogallery.indiatimes.com/
http://timesofindia.indiatimes.com/videos/entertainment/videolist/3812908.cms
javascript://
/life/fashion/articlelistls/2886715.cms
/life-style/relationship/specials/lsspeciallist/6247311.cms
/debatelist/3133631.cms

请指导我提取网页中存在的不同网址和IP地址

1 个答案:

答案 0 :(得分:-1)

使用套接字模块获取IP地址:

import urllib2
from bs4 import BeautifulSoup
import socket
import csv

trg_url='http://timesofindia.indiatimes.com/'
req=urllib2.Request(trg_url)
handle=urllib2.urlopen(req)
page_content=handle.read()
soup=BeautifulSoup(page_content,"lxml")
new_list=soup.find_all('a')

final_list = []
for link in new_list:
    l = link.get('href')
    try:
        final_list.append([l,socket.gethostbyname(l.split('/')[2])])
    except:
        final_list.append([l,[]])

with open('output.csv','wb') as f:
    wr = csv.writer(f)
    for row in final_list:
        wr.writerow(row)