Question

我目前正在浏览网址并在访问/抓取网站时抓取数据。

有时网站的加载时间不合理，没有错误，但是不会完全加载以允许chromedriver / urlopen完成/继续编写脚本而只是处于不稳定状态。

在这种情况下动态测试元素的存在不起作用，因为页面不会完全加载，并且页面不是完全相同的，以测试固定元素（甚至不是像html或h1标签那样丰富的标签等）。

基本上我正在寻找一个代码，如果页面剂量加载

，将在“x”秒后继续下一次循环循环

目前使用Selenium（chromedriver）和Beautifulsoup（BS4）。

def get_emails_from_list(links):
    email=[]
    for link in links:
        driver.get(link)
        html=driver.page_source
        try:
            raw = BeautifulSoup(html, 'html.parser').get_text()
            emails = re.findall(r'[\w\.-]+@[\w\.-]+', raw)
            for em in emails:
                if em not in email:
                    email.append(emails)
        except:
            emails = re.findall(r'[\w\.-]+@[\w\.-]+', str(html))
            for em in emails:
                if em not in email:
                    email.append(emails)
    try:
        email2=list(itertools.chain(*email))
    except:
        email2=email
    return email2

Answer 1

执行此操作的最佳/常规方法是在套接字或用于网络io的库上设置超时。所以你应该考虑一下。

如果不是，可以使用线程或信号。这个使用信号。

import signal, time, random

class TimeoutError (RuntimeError):
    pass

def handler (signum, frame):
    raise TimeoutError()

signal.signal (signal.SIGALRM, handler)

for i in range(5):
    try:
        signal.alarm (3)
        time.sleep (random.randint (1,4))
        print ('ok', i)
    except TimeoutError as ex:
        print ('timeout', i)

更新：

显然，不在Windows上运行。根据{{3}}：在Windows上，signal()只能通过SIGABRT，SIGFPE，SIGILL，SIGINT，SIGSEGV来调用，或SIGTERM。

On Windows, `signal()` can only be called with `SIGABRT`, `SIGFPE`, `SIGILL`, `SIGINT`, `SIGSEGV`, or `SIGTERM`.

Answer 2

在循环内部，您可以等待几秒钟然后中断

from numba import jitclass, float64

spec = [('n', float64),
        ('w', float64),
        ('a', float64)]

@jitclass(spec)
class foo(object):

    def __init__(self,n,w):

        self.n = n
        self.w = w

    def foo2(self):

        a = self.n*self.w

        return a + 1.

Python如何超时/中止并在“X”秒后继续循环迭代

2 个答案: