使用python beautifulsoup进行Webscraping

时间:2016-09-20 13:15:42

标签: python-2.7 beautifulsoup

我正在尝试从网页上获取信息。我想打开网页中的链接并将数据保存在其中。为此,我使用python beautifulsoup包。请帮我打开网页上的内部链接。我的代码遵循:

 from bs4 import BeautifulSoup
 import urllib
 r = urllib.urlopen('http://www-nass.nhtsa.dot.gov/nass/ciren/CaseForm.aspx?      xsl=main.xsl&CaseID=317310129').read()
 soup = BeautifulSoup(r)
 print type(soup)
 all_links = soup.find_all("a")
 for link in all_links:
    print link.get("href")  
    print "\n"

我在这里打印数据,但是如何打开链接并从这些链接中获取信息。印刷后我得到如下信息:

SearchForm.aspx?ClearSearch


ListForm.aspx


CaseForm.aspx?ViewText&CaseID=317310129&xsl=textonly.xsl&websrc=true


javascript:parent.myPrintAll(317310129,'10-149','','v 1.0.112008','CIREN','CIREN was established in 1996 and is a multi-center research program involving a collaboration of clinicians and engineers in academia, industry, and government. Level 1 Trauma Centers are linked together through a computer network associated with this research program. Each Center collects detailed crash and medical data on approximately 50 motor vehicle crashes per year. After the necessary coding and quality control takes place, the information is added to a database on the computer network linking the centers. The CIREN database consists of multiple discrete fields of data concerning these crashes, including crash reconstruction and medical injury profiles. Personal and location identifiers and highly sensitive medical information have been removed from the public files to protect patient confidentiality.');


javascript:parent.myPrint(317310129,'10-149','CIREN');


javascript:parent.ShowXML(317310129,'',317310129,true);


javascript:parent.ShowXSD('../XSD/Ciren2013.xsd','',true);
Please help me on this issue.

0 个答案:

没有答案