用进口时间加快美味汤(刮掉太多无关数据)

时间:2016-10-26 01:36:15

标签: python time beautifulsoup

我的程序将正常运行并打印所需的输出,但运行时间超过一分钟。它抓取整个页面的数据,然后在数据中搜索所需的信息。延迟是抓住数据。基本上我正在寻找的是一种只获取包含所需信息的页面数据的方法,而不是从整个页面中获取所有无关数据并大大减慢过程。

我需要的信息是包含“$”和“LDK2-ENY10”的第一个数据块,变量z仅用于获取第一个数据块而不是之后的所有无关数据。

import requests
from time import time
from bs4 import BeautifulSoup     
z = 0;
link = "http://yugiohprices.com/get_card_prices/Dark+Magician?_={}"
r = requests.get(link.format(int(time())))
soup = BeautifulSoup(r.content, "lxml")
rawdata = soup.find_all("td")
for thing in rawdata:    
  if "LDK2-ENY10" in str(thing) and "$" in str(thing) and z == 0:
    print thing; z = 1;

这是当前的输出,它可能足够快,只需抓取此输出,它似乎有点多余,因为我只需要第10行(<b>$0.33</b>),18($0.77的信息),和29($5.28),但是在这一点上我并不在乎,只是希望程序在不需要2-5分钟的情况下运行..(30秒或更短时间会很惊人)

如果我的帖子中遗漏了任何内容,请告诉我,我会更新。

<td style="width: 206px; padding-right: 10px" valign="top">
<table border="1" id="item_stats" style="margin-bottom: 10px">
<tr>
<td class="key" style="border: 1px solid #000; width: 94px">
              Lowest
              <a alt="How are lowest prices calculated?" href="http://blog.yugiohprices.com/post/90183367316/lowest-card-price-is-now-picked-using-ebay-listings" target="_blank" title="How are lowest prices calculated?">(?)</a>
</td>
<td style="border: 1px solid #000; text-align: center">
<p style="margin: 2px">
<b>$0.33</b>
</p>
</td>
</tr>
<tr>
<td class="key" style="border: 1px solid #000; width: 94px">Highest</td>
<td style="border: 1px solid #000; text-align: center">
<p style="margin: 2px">
                $5.28
              </p>
</td>
</tr>
<tr>
<td class="key" style="border: 1px solid #000; width: 94px">
              Average
              <a alt="How are average prices calculated?" href="http://blog.yugiohprices.com/post/54460976914/how-are-average-prices-calculated" target="_blank" title="How are average prices calculated?">(?)</a>
</td>
<td style="border: 1px solid #000; text-align: center">
<p style="margin: 2px">
                $0.77
              </p>
</td>
</tr>
</table>
<table border="1" id="item_stats">
<tr style="height: 22px">
<td class="key" style="border: 1px solid #000; width: 80px">
<p style="margin-top: -6px; margin-bottom: -4px; margin-left: 0px; margin-right: -6px; font-weight: normal">
<b>
                  Shift
                </b>
<br/>
                (24 Hours)
              </p>
</td>
<td style="border: 1px solid #000; text-align: center">
<p style="margin: 6px; margin-bottom: 9px">
<b style="color: red">
                  -9.41%
                </b>
</p>
</td>
</tr>
<tr>
<td class="key" style="border: 1px solid #000; width: 80px">
<p style="margin-top: -6px; margin-bottom: -4px; margin-left: 0px; margin-right: -6px; font-weight: normal">
<b>Shift</b>
<br/>
                (3 Days)
              </p>
</td>
<td style="border: 1px solid #000; text-align: center">
<p style="margin: 6px; margin-bottom: 9px">
<b style="color: red">
                  -2.53%
                </b>
</p>
</td>
</tr>
<tr>
<td class="key" style="border: 1px solid #000; width: 80px">
<p style="margin-top: -6px; margin-bottom: -4px; margin-left: 0px; margin-right: -6px; font-weight: normal">
<b>Shift</b>
<br/>
                (1 Week)
              </p>
</td>
<td style="border: 1px solid #000; text-align: center">
<p style="margin: 6px; margin-bottom: 9px">
<b style="color: red">
                  -9.41%
                </b>
</p>
</td>
</tr>
<tr>
<td class="key" style="border: 1px solid #000; width: 94px">
<p style="margin-top: -6px; margin-bottom: -4px; margin-left: 0px; margin-right: -6px; font-weight: normal">
<b>Shift</b>
<br/>
                (3 Weeks)
              </p>
</td>
<td style="border: 1px solid #000; text-align: center">
<p style="margin: 6px; margin-bottom: 9px">
<b>
                  0%
                </b>
</p>
</td>
</tr>
<tr>
<td class="key" style="border: 1px solid #000; width: 94px">
<p style="margin-top: -6px; margin-bottom: -4px; margin-left: 0px; margin-right: -6px; font-weight: normal">
<b>Shift</b>
<br/>
                (30 Days)
              </p>
</td>
<td style="border: 1px solid #000; text-align: center">
<p style="margin: 6px; margin-bottom: 9px">
<b>
                  0%
                </b>
</p>
</td>
</tr>
<tr>
<td class="key" style="border: 1px solid #000; width: 94px">
<p style="margin-top: -6px; margin-bottom: -4px; margin-left: 0px; margin-right: -6px; font-weight: normal">
<b>Shift</b>
<br/>
                (3 Months)
              </p>
</td>
<td style="border: 1px solid #000; text-align: center">
<p style="margin: 6px; margin-bottom: 9px">
<b>
                  0%
                </b>
</p>
</td>
</tr>
<tr>
<td class="key" style="border: 1px solid #000; width: 94px">
<p style="margin-top: -6px; margin-bottom: -4px; margin-left: 0px; margin-right: -6px; font-weight: normal">
<b>Shift</b>
<br/>
                (6 Months)
              </p>
</td>
<td style="border: 1px solid #000; text-align: center">
<p style="margin: 6px; margin-bottom: 9px">
<b>
                  0%
                </b>
</p>
</td>
</tr>
<tr>
<td class="key" style="border: 1px solid #000; width: 94px">
<p style="margin-top: -6px; margin-bottom: -4px; margin-left: 0px; margin-right: -6px; font-weight: normal">
<b>Shift</b>
<br/>
                (1 Year)
              </p>
</td>
<td style="border: 1px solid #000; text-align: center">
<p style="margin: 6px; margin-bottom: 9px">
<b>
                  0%
                </b>
</p>
</td>
</tr>
<tr>
<td class="key" colspan="2" style="text-align: center; border: 1px solid #000">
<a href="/price_history/LDK2-ENY10?rarity=Common" target="_blank">View History</a>
</td>
</tr>
</table>
<br/>
<div align="center">
<script async="" src="//pagead2.googlesyndication.com/pagead/js/adsbygoogle.js"></script>
<!-- Yugioh Prices Skyscraper -->
<ins class="adsbygoogle" data-ad-client="ca-pub-7333610178228936" data-ad-slot="9136249004" style="display:inline-block;width:160px;height:600px"></ins>
<script>
            (adsbygoogle = window.adsbygoogle || []).push({});
          </script>
</div>
</td>

1 个答案:

答案 0 :(得分:1)

首先,赔率分析是这里的瓶颈非常低 - 请重新检查这个假设。

第一个合乎逻辑的事情是在找到所需信息后摆脱if (!isspace(ptr[i])) { ++word; while(i < mystringlen && !isspace(ptr[i])){ 变量并简单地打破循环。这应该会对执行时间产生重大影响:

z

或/和,您可以避免解析整个页面源和parse only a part of the document with the SoupStrainer。这些方面的东西:

for thing in rawdata:    
    thing_html = str(thing)  # avoiding calling str() two times per iteration
    if "LDK2-ENY10" in thing_html and "$" in thing_html:
        print(thing)
        break

尽管如此,我怀疑&#34;只解析&#34;考虑到树的相对较小的尺寸,方法会产生重大影响。

您可以尝试的另一件事是使用PyPy解释程序而不是常规from bs4 import BeautifulSoup, SoupStrainer td_only = SoupStrainer("td") soup = BeautifulSoup(r.content, "lxml", parse_only=td_only) 运行脚本。您需要从CPython切换到lxmlhtml.parser(或者您可以install lxml from the fork),但经过几次快速测试后,我可以看到效果优于{{1 }} + html5lib