Question

我对使用python和beautifulsoup有疑问。

我的最终结果程序基本上填写了网站上的表格，并将结果返回给我，我最终会输出到lxml文件。我将从https://interactive.web.insurance.ca.gov/survey/survey?type=homeownerSurvey&event=HOMEOWNERS获取结果，我希望将每个城市的列表都列入一些excel文档中。

这是我的代码，我把它放在pastebin上： http://pastebin.com/bZJfMp2N

我的结果几乎是好的：D除了现在我的“正确值”而不是355，我得到355，例如。我想解析那个并只显示数字，你会看到你把它放到python中。

但是，我尝试的任何东西都不起作用，我无法解析那个values_2变量，因为当我认为我需要解析一个字符串时，结果在bs4.element.resultset中。对不起，如果我是菜鸟，我仍然在学习，并且已经在这个项目上工作了很长时间。

有人会有任何意见吗？任何事都将不胜感激！我已经读过我的结果在列表或其他内容中，我无法解析列表？我该怎么做呢？

以下是代码：

__author__ = 'kennytruong'
#THE PROBLEM HERE IS TO PARSE THE RESULTS PROPERLY!!


import urllib.parse, urllib.request
import re
from bs4 import BeautifulSoup


URL = "https://interactive.web.insurance.ca.gov/survey/survey?type=homeownerSurvey&event=HOMEOWNERS"



#Goes through these locations, strips the whitespace in the string and creates a list that starts at every new line
LOCATIONS = '''
ALAMEDA ALAMEDA
'''.strip().split('\n') #strip() basically removes whitespaces
print('Available locations to choose from:', LOCATIONS)


INSURANCE_TYPES = '''
HOMEOWNERS,CONDOMINIUM,MOBILEHOME,RENTERS,EARTHQUAKE - Single Family,EARTHQUAKE - Condominium,EARTHQUAKE - Mobilehome,EARTHQUAKE - Renters
'''.strip().split(',') #strips the whitespaces and starts a newline of the list every comma
print('Available insurance types to choose from:', INSURANCE_TYPES)


COVERAGE_AMOUNTS = '''
15000,25000,35000,50000,75000,100000,150000,200000,250000,300000,400000,500000,750000
'''.strip().split(',')
print('All options for coverage amounts:', COVERAGE_AMOUNTS)


HOME_AGE = '''
New,1-3 Years,4-6 Years,7-15 Years,16-25 Years,26-40 Years,41-70 Years
'''.strip().split(',')
print('All Home Age Options:', HOME_AGE)




def get_premiums(location, coverage_type, coverage_amt, home_age):
    formEntries = {'location':location,
                   'coverageType':coverage_type,
                   'coverageAmount':coverage_amt,
                   'homeAge':home_age}
    inputData = urllib.parse.urlencode(formEntries)
    inputData = inputData.encode('utf-8')
    request = urllib.request.Request(URL, inputData)
    response = urllib.request.urlopen(request)
    responseData = response.read()
    soup = BeautifulSoup(responseData, "html.parser")
    parseResults = soup.find_all('tr', {'valign':'top'})


    for eachthing in parseResults:
        parse_me = eachthing.text
        name = re.findall(r'[A-z].+', parse_me) #find me all the words that start with a cap, as many and it doesn't matter what kind.
        # the . for any character and + to signify 1 or more of it.
        values = re.findall(r'\d{1,10}', parse_me) #find me any digits, however many #'s long as long as btwn 1 and 10
        values_2 = eachthing.find_all('div', {'align':'right'})

        print('raw code for this part:\n' ,eachthing, '\n')
        print('here is the name: ', name[0], values)
        print('stuff on sheet 1- company name:', name[0], '- Premium Price:', values[0], '- Deductible', values[1])
        print('but here is the correct values - ', values_2) #NEEDA STRIP THESE VALUES
#        print(type(values_2))  DOING SO GIVES ME <class 'bs4.element.ResultSet'>, NEEDA PARSE bs4.element type
#       values_3 = re.split(r'\d', values_2)
#        print(values_3)   ANYTHING LIKE THIS WILL NOT WORK BECAUSE I BELIEVE RESULTS ARENT STRING
        print('\n\n')









def main():
    for location in LOCATIONS: #seems to be looping the variable location in LOCATIONS - each location is one area
        print('Here are the options that you selected: ', location, "HOMEOWNERS", "150000", "New", '\n\n')
        get_premiums(location, "HOMEOWNERS", "150000", "New")  #calls function get_premiums and passes parameters






if __name__ == "__main__": #this basically prevents all the indent level 0 code from getting executed, because otherwise the indent level 0 code gets executed regardless upon opening
    main()

使用BeautifulSoup解析已经解析的结果

0 个答案: