如何使用Beautiful Soup在python中抓取整个整数?

时间:2016-12-15 00:38:32

标签: python parsing web-scraping beautifulsoup

当波浪高度达到两位数范围时,努力从网站获取一些波形高度并且我的代码失败。

例如:目前,代码会将网站中的12分别分别为“1”和“2”,而不是“12”。

#Author: David Owens
#File name: soupScraper.py
#Description: html scraper that takes surf reports from various websites

import csv
import requests
from bs4 import BeautifulSoup

NUM_SITES = 2

reportsFinal = []

###################### SURFLINE URL STRINGS AND TAG ###########################

slRootUrl = 'http://www.surfline.com/surf-report/'
slSunsetCliffs = 'sunset-cliffs-southern-california_4254/'
slScrippsUrl = 'scripps-southern-california_4246/'
slBlacksUrl = 'blacks-southern-california_4245/'
slCardiffUrl = 'cardiff-southern-california_4786/'

slTagText = 'observed-wave-range'
slTag = 'id'

#list of surfline URL endings
slUrls = [slSunsetCliffs, slScrippsUrl, slBlacksUrl]

###############################################################################


#################### MAGICSEAWEED URL STRINGS AND TAG #########################

msRootUrl = 'http://magicseaweed.com/'
msSunsetCliffs = 'Sunset-Cliffs-Surf-Report/4211/'
msScrippsUrl = 'Scripps-Pier-La-Jolla-Surf-Report/296/'
msBlacksUrl = 'Torrey-Pines-Blacks-Beach-Surf-Report/295/'

msTagText = 'rating-text'
msTag = 'li'

#list of magicseaweed URL endings
msUrls = [msSunsetCliffs, msScrippsUrl, msBlacksUrl]

###############################################################################

'''
This class represents a surf break. It contains all wave, wind, & tide data 
associated with that break relevant to the website
'''
class surfBreak:
    def __init__(self, name,low, high, wind, tide):
        self.name = name
        self.low = low
        self.high = high
        self.wind = wind
        self.tide = tide    

    #toString method    
    def __str__(self):
        return '{0}: Wave height: {1}-{2} Wind: {3} Tide: {4}'.format(self.name, 
            self.low, self.high, self.wind, self.tide)
#END CLASS

'''
This returns the proper attribute from the surf report sites
'''
def reportTagFilter(tag):
    return (tag.has_attr('class') and 'rating-text' in tag['class']) \
        or (tag.has_attr('id') and tag['id'] == 'observed-wave-range')
#END METHOD

'''
This method checks if the parameter is of type int
'''
def representsInt(s):
    try: 
        int(s)
        return True

    except ValueError:
        return False
#END METHOD

'''
This method extracts all ints from a list of reports

reports: The list of surf reports from a single website

returns: reportNums - A list of ints of the wave heights
'''
def extractInts(reports):
    print reports
    reportNums = []
    afterDash = False
    num = 0
    tens = 0
    ones = 0

    #extract all ints from the reports and ditch the rest 
    for report in reports:
        for char in report:
            if representsInt(char) == True:

                num = int(char)                 
                reportNums.append(num)

            else:
                afterDash = True

    return reportNums
#END METHOD

'''
This method iterates through a list of urls and extracts the surf report from
the webpage dependent upon its tag location

rootUrl: The root url of each surf website
urlList: A list of specific urls to be appended to the root url for each 
         break

tag:     the html tag where the actual report lives on the page

returns: a list of strings of each breaks surf report
'''
def extractReports(rootUrl, urlList, tag, tagText):
    #empty list to hold reports
    reports = []
    reportNums = []
    index = 0

    #loop thru URLs
    for url in urlList:
        try:
            index += 1
            #request page
            request = requests.get(rootUrl + url)

            #turn into soup
            soup = BeautifulSoup(request.content, 'lxml')

            #get the tag where surflines report lives
            reportTag = soup.findAll(reportTagFilter)[0]

            reports.append(reportTag.text.strip())      

        #notify if fail 
        except:
            print 'scrape failure at URL ', index
            pass

    reportNums = extractInts(reports)

    return reportNums
#END METHOD

'''
This method calculates the average of the wave heights
'''
def calcAverages(reportList):
    #empty list to hold averages
    finalAverages = []
    listIndex = 0
    waveIndex = 0

    #loop thru list of reports to calc each breaks ave low and high
    for x in range(0, 6):
            #get low ave
            average = (reportList[listIndex][waveIndex] 
                + reportList[listIndex+1][waveIndex]) / NUM_SITES

            finalAverages.append(average)

            waveIndex += 1

    return finalAverages
#END METHOD

slReports = extractReports(slRootUrl, slUrls, slTag, slTagText)
msReports = extractReports(msRootUrl, msUrls, msTag, msTagText)

reportsFinal.append(slReports)
reportsFinal.append(msReports)

print 'Surfline:     ', slReports
print 'Magicseaweed: ', msReports

1 个答案:

答案 0 :(得分:0)

你实际上并没有提取整数,但似乎是浮点数,因为reports中的值类似于['0.3-0.6 m']。现在你只需要浏览每一个字符,然后将它们逐个转换为int或丢弃。所以难怪你只会获得一位数的数字。

从该字符串中提取这些数字的一种(可以说是)简单方法是使用regexp:

import re

FLOATEXPR = re.compile("(\d+\.\d)-(\d+\.\d) {0,1}m")

def extractFloats(reports):
    reportNums = []
    for report in reports:
        groups = re.match(FLOATEXPR, report).groups()
        for group in groups:
            reportNums.append(float(group))
    return reportNums

此表达式将匹配您的浮动并将它们作为列表返回。

详细地说,表达式将匹配在'.'之前至少有一位数的任何内容,以及在它之后的一位数,'-'之间,另一个浮动序列和以'm'结尾的任何数字。或' m'。然后它将表示浮动的部分分组到一个元组。例如,['12.0m-3.0m']将返回[12.0, 3.0]。如果您希望浮点后的数字更多,则可以在表达式中的第二个'+':s后添加额外的'd'