我正在尝试对一个网页进行统计。此页面包含类别和产品。我没有下载有关此产品的信息,我只是在计算它们。
重点是我得到Memory Error
错误或只是Some text like Script ends with code -1073741819
(数字是准确的)。
我尝试在每次循环后打印size
变量category_urls
并且不会增加。
修改 当计算的类别太大(大约60 000个网址)时,内存错误会增加。
主循环很简单:
for category in categories:
count_category(category)
我想在每次迭代之后,应该释放内存,但是当我查看Task Manager
时,我看不到任何版本 - > Memory
标签(Python.exe)。我看到内存消耗越来越高。
如果它有助于解决问题:
def count_category(url):
category_urls = list(get_category_urls(url))
mLib.printToFile('database/count.txt',str(len(category_urls)))
set_spracovanie_kategorie(url) # This fnc just writes category url into text file
def get_category_urls(url):
log('Getting category urls: {}'.format(url))
urls = []
next_url = url
i=1
while next_url:
root = load_root(next_url)
urls.extend(get_products_on_page(root))
for x in urls:
if 'weballow' in x:
yield x
next_url = next_page(root, url) (next page is defined below)
# if next_url == False:
# return urls
i+=1
def get_products_on_page(root):
hrefs = root.xpath('//div[@id="product-contain"]//h2/a/@href')
return hrefs
和LXML加载功能
class RedirectException(Exception):
pass
def load_url(url):
r = requests.get(url,allow_redirects=False)
if r.status_code == 301:
raise RedirectException
html = r.text
return html
def load_root(url):
html = load_url(url)
return etree.fromstring(html, etree.HTMLParser())
下一页:
def next_page(root, url):
next = root.xpath('//a[@class="next"]/@href')
if len(next) > 0:
return urljoin(url, next[0])
return False
你能给我并建议做什么吗?