Question

我必须将两个文本文件合并为一个，然后从中创建一个新列表。第一个包含url，另一个包含urlpaths /文件夹，必须应用于每个url。我正在使用列表，它的速度非常慢，因为它大约有200,000个项目。

样品：

urls.txt：

 http://wwww.google.com
 ....

paths.txt：

 /abc
 /bce
 ....

稍后，循环结束后，应该有一个带

的新列表

http://wwww.google.com/abc
http://wwww.google.com/bce

Python代码：

URLS_TO_CHECK = [] #defined as global, needed later

def generate_list():
  urls = open("urls.txt", "r").read().splitlines()
  paths = open("paths.txt", "r").read().splitlines()
  done = open("done.txt", "r").read().splitlines() #old done urls

  for i in range(len(urls)):
    for x in range(len(paths)):
        url = re.search('(http://(.+?)....)', urls[i]) #needed
        url = "%s%s" %(url.group(1), paths[x])
        if url not in URLS_TO_CHECK:
            if url not in done:
                URLS_TO_CHECK.append(url) ##<<< slow!

已经阅读了有关map函数的其他一些主题，请禁用gc，但无法在我的程序中使用map函数。并禁用gc并没有真正帮助。

Answer 1

这种方法利用了以下内容：

快速查找集合 - O（1）而不是O（n）
按需生成值，而不是将整个列表构建为一次
从文件中读取数据，而不是一次性加载整个数据
避免不必要的正则表达

def yield_urls():
    with open("paths.txt") as f:
        paths = f.readlines() # needed in each iteration and iterates over, may be list

    with open("done.txt") as f:
        done_urls = set(f.readlines()) # needed in each iteration and looked up, set is O(1) vs O(n) in list 

    # resources are cleaned up after with

    with open("urls.txt", "r") as f:
        for url in f: # iterate over list, not big list of ints generated before iteratiob, much quicker
            for subpath in paths:
                full_url = ''.join((url[7:], subpath)) # no regex means faster, maybe string formatting is quicker than join, you need to check
                # also, take care about trailing newlines in strings read from file
                if full_url not in done_urls:  # fast lookup in set
                    yield full_url  # yield instead of appending

# usage
for url in yield_urls():
    pass  # to something with url

Answer 2

 URLS_TO_CHECK = set(re.findall("'http://(.+?)....'",open("urls.txt", "r").read()))
 for url in URLS_TO_CHECK:
     for path in paths:
         check_url(url+path)

可能会快得多......而且我认为它基本相同......

Answer 3

在词典中搜索比列表Python: List vs Dict for look up table

更快

URLS_TO_CHECK = {} #defined as global, needed later

def generate_list():
  urls = open("urls.txt", "r").read().splitlines()
  paths = open("paths.txt", "r").read().splitlines()
  done = dict([(l, True) for l in open("done.txt", "r").read().splitlines()]) #old done urls

  for i in range(len(urls)):
    for x in range(len(paths)):
      url = re.search('(http://(.+?)....)', urls[i]) #needed
      url = "%s%s" %(url.group(1), paths[x])
      if not url in URLS_TO_CHECK:
        if not url in done:
          URLS_TO_CHECK[url] = True #Result in URLS_TO_CHECK.keys()

Python列表附加缓慢？

3 个答案: