Python:BeautifulSoup Web Parser,如何在不返回html标签的情况下刮取文本和Â

时间:2016-06-06 03:09:02

标签: python beautifulsoup html-parsing bots

所以我的代码返回html标签以及符号“”如何删除所有html标签和符号''。我知道这个符号,我必须对unicode做点什么。

import csv
import requests 
from bs4 import BeautifulSoup
from itertools import izip

grant_number = ['0901289','0901282','0901260']
#IMPORTANT NOTE: PLACE GRANT NUMBERS BETWEEN STRINGS WITH NO SPACES

start = 'this site'
end = 'Please report errors'
#start and end show the words that come right before the publication data; This program will scrape for text in between these phrases
my_string = []
#my_string is an empty list for the publication data


for x in grant_number:      # Number of pages plus one 
    url = "http://nsf.gov/awardsearch/showAward?AWD_ID={}".format(x)
    r = requests.get(url)
    soup = BeautifulSoup(r.content, "html.parser")
    soup_string = str(soup)
    my_string.append(soup_string[(soup_string.index(start)+len(start)):soup_string.index(end)])
with open('NSF.csv', 'wb') as f:
    #Default Filename is NSF.csv ; This can be changed by editing the first field after 'open('
    writer = csv.writer(f)
    writer.writerows(izip(grant_number, my_string))
#this imports the lists into a csv file with two columns, grant number on left, publication data on right

3 个答案:

答案 0 :(得分:0)

如果你只想获得文字(我不明白你为什么要这样做)......你应该这样做:

soup = BeautifulSoup(r.content, "html.parser")
soup_string = soup.text

如果要删除空格,请执行以下操作:

soup = BeautifulSoup(r.content, "html.parser")
soup_string = soup.text.strip()

答案 1 :(得分:-1)

如果你这样做:

soup = BeautifulSoup(r.content, "html.parser") print soup.get_text()

你应该得到这样的东西:

NSF Award Search: Award#0901289 - Rational points on elliptic curves over totally real fields and p-adic L-functions


    var printFlag = false;

    function printThisPage() {
    document.getElementById('printFriendly').style.display='none';
    document.getElementById('printFriendly2').style.display='none';
    document.getElementById('printFriendly3').style.display='none';
    document.getElementById('printFriendly5').style.display='none';
    document.getElementById('printFriendly51').style.display='none';
    document.getElementById('printFriendly6').style.display='none';
    document.getElementById('printFriendly7').style.display='block';


    //if (navigator.appName=="Microsoft Internet Explorer"){
    //  window.print();
    //}
    //else{

    //window.refresh();

    window.print();
    //}


    opener.printFlag = false;   }   function popwin(url)   {
     //alert('popwin url = ' + url);
     var hNewWnd = window.open(url,"","width=520,height=590,left=480,resizable=yes,status=yes,scrollbars=yes");
     if ((document.window != null) && (!hNewWnd.opener))

       hNewWnd.opener = document.window;   }

    function printerFriendlyView()
     {  
     printFlag = true;
     var printerFriendlyViewWin = window.open(document.URL, "printerFriendlyViewWin","menubar=1,toolbar=0,scrollbars=1,alwaysRaised=1,width=600,height=600,resizable=1");
     }
    Research Areas   Biological Sciences Computer & Information Science & Engineering Education and Human Resources Engineering Environmental Research & Education Geosciences Office of International & Integrative Activities Mathematical & Physical Sciences Social, Behavioral & Economic Sciences

    Learning Resources   Film, TV, Exhibits & More! Slideshows & Photo Galleries Classroom Resources Funding for Research on Learning in Formal & Informal Settings



    Funding & Awards Funding Info   Search Funding Opportunities Browse Funding Opportunities A-Z Recent Funding Opportunities How to Prepare a Funding Proposal Grant Proposal Guide Submit a Proposal to FastLane   Award Info   Managing Awards Award & Administration Guide Search Awards Award Statistics (Budget Internet Info System)

    News & Discoveries   Recent News Recent Discoveries Multimedia Gallery Special Reports



    Contact Us   Staff Directory Organization List Visit NSF Work at NSF Do Business with NSF Press Inspector General Hotline How Do I …?
    The National Science Foundation 4201 Wilson Boulevard, Arlington, Virginia 22230, USA Tel: (703) 292-5111 FIRS: (800) 877-8339 TDD: (800) 281-8749

  Home Funding
    Search Funding Opportunities Browse Opportunities A-Z Recent Opportunities Due Dates Preparing Proposals Policies & Procedures Merit Review Interdisciplinary Research Transformative Research About Funding

    Awards
    About Awards Managing Awards Policies & Procedures Award Conditions Search Awards Presidential & Honorary Awards Award Statistics (Budget Internet Info System)

    Discoveries
    Discoveries Home Arctic & Antarctic Astronomy & Space Biology Chemistry & Materials Computing Earth & Environmental Science Education
    Engineering Mathematics Nanoscience People & Society Physics Search Discoveries About Discoveries

    News
    News Home For News Media Multimedia Gallery Special Reports News from the Field Research Overviews Speeches & Lecture NSF Current Newsletter NSF-Wide Investments News Archive Search News

    Publications
    Publications Home Search Publications Obtaining Publications

    Statistics
    NCSES Home NCSES Data NCSES Publications NCSES Surveys NCSES Topics Search NCSES About NCSES

    About NSF
    About NSF History Visit NSF Contact NSF Staff Directory Organization List Career Opportunities Contracting Opportunities NSF & Congress Budget Performance Assessment Info Partners Broadening Participation/Diversity Office of Diversity & Inclusion

    Fastlane   a {

    color: #3c75cf;

    text-decoration: none;   }

    a:hover {

    background-color: #c2f96b;   }

    th {

    text-align: left;   }

    .two_liner li {

    margin-left: 20px;

    text-indent: -20px;

    list-style-type: none;   }

    .two_liner {

    margin: 0px;   }

    .block_indent {

    padding-left: 15px;   }

    .lineoff {

    text-decoration: none;   }

    .lineoff a {

    text-decoration: none;

    color: #FF0000;   }

    .rightcol {

    padding: 7px;

    font-family: Verdana, Arial, Helvetica, sans-serif;

    font-size: x-small;   }

    .rightimage {

    padding-bottom: 4px;   }

    .rightcol p {

    padding-bottom: 4px;   }

    .rightcol2 {

    padding: 7px;

    font-family: Verdana, Arial, Helvetica, sans-serif;

    font-size: x-small;   }

    .rightcol2 a {

    text-decoration: underline;   }

要查看完整输出,请查看此粘贴: http://pastebin.com/TMmc7Yxa

我的模块版本:

beautifulsoup4==4.4.1
bs4==0.0.1
requests==2.9.1

操作系统:Windows 10 x64

Python版本:2.x

答案 2 :(得分:-1)

尝试导入:

import sys
import requests
from BeautifulSoup import BeautifulSoup


reload(sys)
sys.setdefaultencoding("utf-8")