所以我想去http://www.medhelp.org/forums/list然后那里有很多不同疾病的链接。在每个链接中,有几个页面,每个页面都有我想要的链接。
我想获得一些链接。所以我使用了这段代码:
myArray=[]
html_page = urllib.request.urlopen("http://www.medhelp.org/forums/list")
soup = bs(html_page)
temp =soup.findAll('div',attrs={'class' : 'forums_link'})
for div in temp:
myArray.append('http://www.medhelp.org' + div.a['href'])
myArray_for_questions=[]
myPages=[]
#this for is going over all links on the main page. in this case, all
diseases
for link in myArray:
# "link" is the URL for each link in the main page of our website
html_page = urllib.request.urlopen(link)
soup1 = bs(html_page)
#getting the questions's links in the first page
temp =soup1.findAll('div',attrs={'class' : 'subject_summary'})
for div in temp:
myArray_for_questions.append('http://www.medhelp.org' + div.a['href'])
#now getting the URL for all next pages for this page
pages = soup1.findAll('a' ,href=True, attrs={'class' : 'page_nav'})
for l in pages:
html_page_t = urllib.request.urlopen('http://www.medhelp.org'
+l.get('href'))
soup_t = bs(html_page_t)
other_pages = soup_t.findAll('a' ,href=True, attrs={'class' :
'page_nav'})
for p in other_pages:
mystr='http://www.medhelp.org' +p.get('href')
if mystr not in myPages:
myPages.append(mystr)
if p not in pages:
pages.append(p)
# getting all links inside this page which are people's questions
for page in myPages:
html_page1 = urllib.request.urlopen(page)
soup2 = bs(html_page1)
temp =soup2.findAll('div',attrs={'class' : 'subject_summary'})
for div in temp:
myArray_for_questions.append('http://www.medhelp.org' +
div.a['href'])
但是从所有页面获取我想要的所有链接需要永远。任何想法?
由于
答案 0 :(得分:0)
尝试使用scrapy教程,按照它尝试替换随网页提供的示例: