所以我的代码返回html标签以及符号“”如何删除所有html标签和符号''。我知道这个符号,我必须对unicode做点什么。
import csv
import requests
from bs4 import BeautifulSoup
from itertools import izip
grant_number = ['0901289','0901282','0901260']
#IMPORTANT NOTE: PLACE GRANT NUMBERS BETWEEN STRINGS WITH NO SPACES
start = 'this site'
end = 'Please report errors'
#start and end show the words that come right before the publication data; This program will scrape for text in between these phrases
my_string = []
#my_string is an empty list for the publication data
for x in grant_number: # Number of pages plus one
url = "http://nsf.gov/awardsearch/showAward?AWD_ID={}".format(x)
r = requests.get(url)
soup = BeautifulSoup(r.content, "html.parser")
soup_string = str(soup)
my_string.append(soup_string[(soup_string.index(start)+len(start)):soup_string.index(end)])
with open('NSF.csv', 'wb') as f:
#Default Filename is NSF.csv ; This can be changed by editing the first field after 'open('
writer = csv.writer(f)
writer.writerows(izip(grant_number, my_string))
#this imports the lists into a csv file with two columns, grant number on left, publication data on right
答案 0 :(得分:0)
如果你只想获得文字(我不明白你为什么要这样做)......你应该这样做:
soup = BeautifulSoup(r.content, "html.parser")
soup_string = soup.text
如果要删除空格,请执行以下操作:
soup = BeautifulSoup(r.content, "html.parser")
soup_string = soup.text.strip()
答案 1 :(得分:-1)
如果你这样做:
soup = BeautifulSoup(r.content, "html.parser")
print soup.get_text()
你应该得到这样的东西:
NSF Award Search: Award#0901289 - Rational points on elliptic curves over totally real fields and p-adic L-functions
var printFlag = false;
function printThisPage() {
document.getElementById('printFriendly').style.display='none';
document.getElementById('printFriendly2').style.display='none';
document.getElementById('printFriendly3').style.display='none';
document.getElementById('printFriendly5').style.display='none';
document.getElementById('printFriendly51').style.display='none';
document.getElementById('printFriendly6').style.display='none';
document.getElementById('printFriendly7').style.display='block';
//if (navigator.appName=="Microsoft Internet Explorer"){
// window.print();
//}
//else{
//window.refresh();
window.print();
//}
opener.printFlag = false; } function popwin(url) {
//alert('popwin url = ' + url);
var hNewWnd = window.open(url,"","width=520,height=590,left=480,resizable=yes,status=yes,scrollbars=yes");
if ((document.window != null) && (!hNewWnd.opener))
hNewWnd.opener = document.window; }
function printerFriendlyView()
{
printFlag = true;
var printerFriendlyViewWin = window.open(document.URL, "printerFriendlyViewWin","menubar=1,toolbar=0,scrollbars=1,alwaysRaised=1,width=600,height=600,resizable=1");
}
Research Areas Biological Sciences Computer & Information Science & Engineering Education and Human Resources Engineering Environmental Research & Education Geosciences Office of International & Integrative Activities Mathematical & Physical Sciences Social, Behavioral & Economic Sciences
Learning Resources Film, TV, Exhibits & More! Slideshows & Photo Galleries Classroom Resources Funding for Research on Learning in Formal & Informal Settings
Funding & Awards Funding Info Search Funding Opportunities Browse Funding Opportunities A-Z Recent Funding Opportunities How to Prepare a Funding Proposal Grant Proposal Guide Submit a Proposal to FastLane Award Info Managing Awards Award & Administration Guide Search Awards Award Statistics (Budget Internet Info System)
News & Discoveries Recent News Recent Discoveries Multimedia Gallery Special Reports
Contact Us Staff Directory Organization List Visit NSF Work at NSF Do Business with NSF Press Inspector General Hotline How Do I …?
The National Science Foundation 4201 Wilson Boulevard, Arlington, Virginia 22230, USA Tel: (703) 292-5111 FIRS: (800) 877-8339 TDD: (800) 281-8749
Home Funding
Search Funding Opportunities Browse Opportunities A-Z Recent Opportunities Due Dates Preparing Proposals Policies & Procedures Merit Review Interdisciplinary Research Transformative Research About Funding
Awards
About Awards Managing Awards Policies & Procedures Award Conditions Search Awards Presidential & Honorary Awards Award Statistics (Budget Internet Info System)
Discoveries
Discoveries Home Arctic & Antarctic Astronomy & Space Biology Chemistry & Materials Computing Earth & Environmental Science Education
Engineering Mathematics Nanoscience People & Society Physics Search Discoveries About Discoveries
News
News Home For News Media Multimedia Gallery Special Reports News from the Field Research Overviews Speeches & Lecture NSF Current Newsletter NSF-Wide Investments News Archive Search News
Publications
Publications Home Search Publications Obtaining Publications
Statistics
NCSES Home NCSES Data NCSES Publications NCSES Surveys NCSES Topics Search NCSES About NCSES
About NSF
About NSF History Visit NSF Contact NSF Staff Directory Organization List Career Opportunities Contracting Opportunities NSF & Congress Budget Performance Assessment Info Partners Broadening Participation/Diversity Office of Diversity & Inclusion
Fastlane a {
color: #3c75cf;
text-decoration: none; }
a:hover {
background-color: #c2f96b; }
th {
text-align: left; }
.two_liner li {
margin-left: 20px;
text-indent: -20px;
list-style-type: none; }
.two_liner {
margin: 0px; }
.block_indent {
padding-left: 15px; }
.lineoff {
text-decoration: none; }
.lineoff a {
text-decoration: none;
color: #FF0000; }
.rightcol {
padding: 7px;
font-family: Verdana, Arial, Helvetica, sans-serif;
font-size: x-small; }
.rightimage {
padding-bottom: 4px; }
.rightcol p {
padding-bottom: 4px; }
.rightcol2 {
padding: 7px;
font-family: Verdana, Arial, Helvetica, sans-serif;
font-size: x-small; }
.rightcol2 a {
text-decoration: underline; }
要查看完整输出,请查看此粘贴: http://pastebin.com/TMmc7Yxa
我的模块版本:
beautifulsoup4==4.4.1
bs4==0.0.1
requests==2.9.1
操作系统:Windows 10 x64
Python版本:2.x
答案 2 :(得分:-1)
尝试导入:
import sys
import requests
from BeautifulSoup import BeautifulSoup
reload(sys)
sys.setdefaultencoding("utf-8")