从网站中提取链接

时间:2017-07-06 14:57:19

标签: python

所以我想去http://www.medhelp.org/forums/list然后那里有很多不同疾病的链接。在每个链接中,有几个页面,每个页面都有我想要的链接。

我想获得一些链接。所以我使用了这段代码:

myArray=[]
html_page = urllib.request.urlopen("http://www.medhelp.org/forums/list")
soup = bs(html_page)
temp =soup.findAll('div',attrs={'class' : 'forums_link'})
for div in temp:
  myArray.append('http://www.medhelp.org' + div.a['href'])
myArray_for_questions=[]
myPages=[]

#this for is going over all links on the main page. in this case, all 
diseases
for link in myArray:

  # "link" is the URL for each link in the main page of our website
  html_page = urllib.request.urlopen(link)
  soup1 = bs(html_page)

  #getting the questions's links in the first page
  temp =soup1.findAll('div',attrs={'class' : 'subject_summary'}) 
  for div in temp:
     myArray_for_questions.append('http://www.medhelp.org' + div.a['href'])

  #now getting the URL for all next pages for this page
  pages = soup1.findAll('a' ,href=True, attrs={'class' : 'page_nav'})
  for l in pages:
    html_page_t = urllib.request.urlopen('http://www.medhelp.org' 
    +l.get('href'))
    soup_t = bs(html_page_t)
    other_pages = soup_t.findAll('a' ,href=True, attrs={'class' : 
    'page_nav'})
    for p in other_pages:
        mystr='http://www.medhelp.org' +p.get('href')   
        if mystr not in myPages:
            myPages.append(mystr)
        if p not in pages:
            pages.append(p)

  # getting all links inside this page which are people's questions
  for page in myPages:
      html_page1 = urllib.request.urlopen(page)
      soup2 = bs(html_page1)
      temp =soup2.findAll('div',attrs={'class' : 'subject_summary'}) 
      for div in temp:
        myArray_for_questions.append('http://www.medhelp.org' + 
        div.a['href'])

但是从所有页面获取我想要的所有链接需要永远。任何想法?

由于

1 个答案:

答案 0 :(得分:0)

尝试使用scrapy教程,按照它尝试替换随网页提供的示例:

https://doc.scrapy.org/en/latest/intro/tutorial.html