使用BeautifulSoup以同样的方法刮掉li和id

时间:2016-10-13 21:04:04

标签: python html web-scraping beautifulsoup

我如何修改find​​All方法的参数来读取li和id? li是元素,id是属性正确吗?

#Author: David Owens
#File name: soupScraper.py
#Description: html scraper that takes surf reports from various websites

import csv
import requests
from bs4 import BeautifulSoup

 ###################### SURFLINE URL STRINGS AND TAG ###########################

slRootUrl = 'http://www.surfline.com/surf-report/'
slSunsetCliffs = 'sunset-cliffs-southern-california_4254/'
slScrippsUrl = 'scripps-southern-california_4246/'
slBlacksUrl = 'blacks-southern-california_4245/'
slCardiffUrl = 'cardiff-southern-california_4786/'

slTagText = 'observed-wave-range'
slTag = 'id'

#list of surfline URL endings
slUrls = [slSunsetCliffs, slScrippsUrl, slBlacksUrl, slCardiffUrl]

###############################################################################


#################### MAGICSEAWEED URL STRINGS AND TAG #########################

msRootUrl = 'http://magicseaweed.com/'
msSunsetCliffs = 'Sunset-Cliffs-Surf-Report/4211/'
msScrippsUrl = 'Scripps-Pier-La-Jolla-Surf-Report/296/'
msBlacksUrl = 'Torrey-Pines-Blacks-Beach-Surf-Report/295/'

msTagText = 'rating-text text-dark'
msTag = 'li'

#list of magicseaweed URL endings
msUrls = [msSunsetCliffs, msScrippsUrl, msBlacksUrl]

###############################################################################

'''
This method iterates through a list of urls and extracts the surf report from
the webpage dependent upon its tag location

rootUrl: The root url of each surf website
urlList: A list of specific urls to be appended to the root url for each 
     break
tag:     the html tag where the actual report lives on the page

returns: a list of strings of each breaks surf report
'''
def extract_Reports(rootUrl, urlList, tag, tagText):
    #empty list to hold reports
    reports = []
    #loop thru URLs
    for url in urlList:
        try:
            #request page
            request = requests.get(rootUrl + url)

            #turn into soup
            soup = BeautifulSoup(request.content, 'lxml')

            #get the tag where report lives
            reportTag = soup.findAll(id = tagText)

            for report in reportTag:
                reports.append(report.string.strip())

        #notify if fail 
        except:
            print 'scrape failure'
            pass

    return reports
#END METHOD

slReports = extract_Reports(slRootUrl, slUrls, slTag, slTagText)
msReports = extract_Reports(msRootUrl, msUrls, msTag, msTagText)

print slReports
print msReports

截至目前,只有slReports打印正确,因为我明确设置为id = tagText。我也知道目前没有使用我的标签参数。

1 个答案:

答案 0 :(得分:0)

所以问题是你要在解析树中搜索类名为rating-text的元素(事实证明你不需要text-dark来识别案例中的相关元素使用单个observed-wave-range电话,获得Magicseaweed)或ID findAll

您可以使用filter function来实现此目的:

def reportTagFilter(tag):
    return (tag.has_attr('class') and 'rating-text' in tag['class']) \
        or (tag.has_attr('id') and tag['id'] == 'observed-wave-range')

然后将您的extract_Reports功能更改为:

        reportTag = soup.findAll(reportTagFilter)[0]
        reports.append(reportTag.text.strip())