Question

我编写了以下代码来从网站（例如 https://www.oddsportal.com/soccer/new-zealand/football-championship/hamilton-canterbury-GhUEDiE0/）中抓取数据。有问题的数据是可以在页面 HTML 代码中找到的高/低值：

            <tr class="lo odd">
                <td>
                    <div class="l"><a class="name2" title="Go to Pinnacle website!" onclick="return !window.open(this.href)" href="/bookmaker/pinnacle/link/"><span class="blogos l18"></span></a>&nbsp;<a class="name" title="Go to Pinnacle website!" onclick="return !window.open(this.href)"
                            href="/bookmaker/pinnacle/link/">Pinnacle</a>&nbsp;&nbsp;</div><span class="ico-bookmarker-info ico-bookmaker-detail"><a title="Show more details about Pinnacle" href="/bookmaker/pinnacle/"></a></span></td>
                <td class="center">+0.5</td>
                <td class="right odds">
                    <div class=" deactivateOdd" onmouseout="delayHideTip()" onmouseover="page.hist(this,'P-0.50-0-0','4j5hgx1tkucx1ix0',18,event,0,1)">1.10</div>
                </td>
                <td class="right odds up-dark">
                    <div class=" deactivateOdd" onmouseout="delayHideTip()" onmouseover="page.hist(this,'P-0.50-0-0','4j5hgx1tl1gx1ix0',18,event,0,1)">7.85</div>
                </td>
                <td class="center info-value"><span>-</span></td>
                <td onmouseout="delayHideTip()" class="check ch1" xparam="The match has already started~2"></td>
            </tr>

有趣的部分是高/低值，例如这里的 1.10、7.85。这些数据被抓取并排列在一个数据框中：

    master_df= pd.DataFrame()

    for match in self.all_links:
    #for match in links:

        self.openmatch(match)
        self.clickou()
        self.expandodds()   
        for x in range(1,28):
            L = []
            bookmakers=['Asianodds','Pinnacle']

                #odds_type=fi2('//*[@id="odds-data-table"]/div{}/div/strong/a'.format(x))
            if x==1:
                over_under_type= 'Over/Under +0.5'
            elif x==4:
                over_under_type= 'Over/Under +1'
            elif x==6:
                over_under_type= 'Over/Under +1.5'
            elif x==8:
                over_under_type= 'Over/Under +1.75'
            elif x==9:
                over_under_type= 'Over/Under +2'  
            elif x==10:
                over_under_type= 'Over/Under +2.25'
            elif x==11:
                over_under_type= 'Over/Under +2.5'
            elif x==13:
                over_under_type= 'Over/Under +2.75'
            elif x==14:
                over_under_type= 'Over/Under +3' 
            elif x==16:
                over_under_type= 'Over/Under +3.5'  
            elif x==19:
                over_under_type= 'Over/Under +4'
            elif x==21:
                over_under_type= 'Over/Under +4.5'
            elif x==26:
                over_under_type= 'Over/Under +5.5'
            elif x==28:
                over_under_type= 'Over/Under +6.5' 

            for j in range(1,15): # only first 10 bookmakers displayed
                Book = self.ffi('//*[@id="odds-data-table"]/div[{}]/table/tbody/tr[{}]/td[1]/div/a[2]'.format(x,j)) # first bookmaker name
                Odd_1 = self.fffi('//*[@id="odds-data-table"]/div[{}]/table/tbody/tr[{}]/td[3]/div'.format(x,j)) # first home odd
                Odd_2 = self.fffi('//*[@id="odds-data-table"]/div[{}]/table/tbody/tr[{}]/td[4]/div'.format(x,j)) # first away odd
                match = self.ffi('//*[@id="col-content"]/h1') # match teams
                final_score = self.ffi('//*[@id="event-status"]')
                date = self.ffi('//*[@id="col-content"]/p[1]') # Date and time
                print(match, Book, Odd_1, Odd_2, date, final_score, link, over_under_type, '/ 500 ')
                L = L + [(match, Book, Odd_1, Odd_2, date, final_score, link, over_under_type)]
                data_df = pd.DataFrame(L)

                try:
                    data_df.columns = ['TeamsRaw', 'Bookmaker', 'Over', 'Under', 'DateRaw' ,'ScoreRaw','Link','Over Under Type']
                except:
                    print('Function crashed, probable reason : no games scraped (empty season)')
                master_df=pd.concat([master_df,data_df])

我的问题是，使用这段代码，每次迭代执行大约需要 5 分钟。我现在正试图使程序的性能更高。我想可能有比拥有所有 for 循环更优雅的方法来实现这一目标？我需要它们以便为每个 xpath 获得正确的“div”。我很乐意为您提供一些建议！

Answer 1

我建议您分析您的代码以查看瓶颈所在。 cProfile 是我通常使用的一种。

如何让我的 Python 代码执行得更快？

1 个答案: