所以我为我的朋友编写了一个爬虫程序,它将浏览大量的搜索结果网页,从页面中拉出所有链接,检查它们是否在输出文件中,如果它们不是,则添加那里。它需要大量的调试,但效果很好!不幸的是,这个小虫子真的很挑剔它认为重要的锚定标签。
以下是代码:
#!C:\Python27\Python.exe
from bs4 import BeautifulSoup
from urlparse import urljoin #urljoin is a class that's included in urlparse
import urllib2
import requests #not necessary but keeping here in case additions to code in future
urls_filename = "myurls.txt" #this is the input text file,list of urls or objects to scan
output_filename = "output.txt" #this is the output file that you will export to Excel
keyword = "skin" #optional keyword, not used for this script. Ignore
with open(urls_filename, "r") as f:
url_list = f.read() #This command opens the input text file and reads the information inside it
with open(output_filename, "w") as f:
for url in url_list.split("\n"): #This command splits the text file into separate lines so it's easier to scan
hdr = {'User-Agent': 'Mozilla/5.0'} #This (attempts) to tell the webpage that the program is a Firefox browser
try:
response = urllib2.urlopen(url) #tells program to open the url from the text file
except:
print "Could not access", url
continue
page = response.read() #this assigns a variable to the open page. like algebra, X=page opened
soup = BeautifulSoup(page) #we are feeding the variable to BeautifulSoup so it can analyze it
urls_all = soup('a') #beautiful soup is analyzing all the 'anchored' links in the page
for link in urls_all:
if('href' in dict(link.attrs)):
url = urljoin(url, link['href']) #this combines the relative link e.g. "/support/contactus.html" and adds to domain
if url.find("'")!=-1: continue #explicit statement that the value is not void. if it's NOT void, continue
url=url.split('#')[0]
if (url[0:4] == 'http' and url not in output_filename): #this checks if the item is a webpage and if it's already in the list
f.write(url + "\n") #if it's not in the list, it writes it to the output_filename
除以下链接外,它的效果很好: https://research.bidmc.harvard.edu/TVO/tvotech.asp
这个链接有一些像“tvotech.asp?Submit = List& ID = 796”,它直接忽略了它们。进入我的输出文件的唯一锚点是主页面本身。这很奇怪,因为查看源代码,它们的锚是非常标准的,如 - 他们有'a'和'href',我认为没有理由bs4只会传递它而只包含主链接。请帮忙。我已经尝试从第30行删除http或将其更改为https,这只是删除了所有结果,甚至主页都没有进入输出。
答案 0 :(得分:0)
这导致其中一个链接在其href中有一个mailto,然后将其设置为url
参数并打破其余链接,导致不通过url[0:4] == 'http'
条件,它看起来像这样:
mailto:research@bidmc.harvard.edu?subject=Question about TVO Available Technology Abstracts
你应该过滤掉它,或者不要在循环中使用相同的参数url
,注意对 url1 的更改:
for link in urls_all:
if('href' in dict(link.attrs)):
url1 = urljoin(url, link['href']) #this combines the relative link e.g. "/support/contactus.html" and adds to domain
if url1.find("'")!=-1: continue #explicit statement that the value is not void. if it's NOT void, continue
url1=url1.split('#')[0]
if (url1[0:4] == 'http' and url1 not in output_filename): #this checks if the item is a webpage and if it's already in the list
f.write(url1 + "\n") #if it's not in the list, it writes it to the output_filename