Question

我正在尝试对一个网页进行统计。此页面包含类别和产品。我没有下载有关此产品的信息，我只是在计算它们。

重点是我得到Memory Error错误或只是Some text like Script ends with code -1073741819（数字是准确的）。

我尝试在每次循环后打印size变量category_urls并且不会增加。

修改当计算的类别太大（大约60 000个网址）时，内存错误会增加。

主循环很简单：

for category in categories:
    count_category(category)

我想在每次迭代之后，应该释放内存，但是当我查看Task Manager时，我看不到任何版本 - ＆gt; Memory标签（Python.exe）。我看到内存消耗越来越高。

如果它有助于解决问题：

def count_category(url):
    category_urls = list(get_category_urls(url))
    mLib.printToFile('database/count.txt',str(len(category_urls)))
    set_spracovanie_kategorie(url) # This fnc just writes category url into text file

def get_category_urls(url):
    log('Getting category urls: {}'.format(url))
    urls = []
    next_url = url

    i=1
    while next_url:
        root = load_root(next_url)
        urls.extend(get_products_on_page(root))
        for x in urls:
            if 'weballow' in x:
                yield x
        next_url = next_page(root, url) (next page is defined below)
        # if next_url == False:
        #     return urls
        i+=1

def get_products_on_page(root):
    hrefs = root.xpath('//div[@id="product-contain"]//h2/a/@href')
    return hrefs

和LXML加载功能

class RedirectException(Exception):
    pass

def load_url(url):
    r = requests.get(url,allow_redirects=False)
    if r.status_code == 301:
        raise RedirectException
    html = r.text
    return html

def load_root(url):
    html = load_url(url)
    return etree.fromstring(html, etree.HTMLParser())

下一页：

def next_page(root, url):
    next = root.xpath('//a[@class="next"]/@href')
    if len(next) > 0:
        return urljoin(url, next[0])
    return False

你能给我并建议做什么吗？

为什么我的脚本会返回内存错误？

0 个答案: