Question

我有以下脚本将搜索字词发布到表单中并检索结果：

import mechanize

url = "http://www.taliesin-arlein.net/names/search.php"
br = mechanize.Browser()
br.set_handle_robots(False) # ignore robots
br.open(url)
br.select_form(name="form")
br["search_surname"] = "*"
res = br.submit()
content = res.read()
with open("surnames.txt", "w") as f:
    f.write(content)

然而，渲染的网页以及此处的脚本将搜索范围限制为250个结果。有什么办法可以绕过这个限制并检索所有结果吗？

谢谢

Answer 1

您可以简单地迭代可能的前缀以绕过限制。每个查询有270,000个名称和250个结果的限制，因此您需要至少生成1080个请求，字母表中有26个字母，因此如果我们假设存在均匀分布，则这意味着我们需要使用一些2个字母作为前缀（log（1080）/ log（26）），但它不太可能是偶数（毕竟有多少人的姓氏以ZZ开头）。

为了解决这个问题，我们使用修改后的深度优先搜索，如下所示：

import string
import time
import mechanize

def checkPrefix(prefix):
    #Return list of names with this prefix.
    url = "http://www.taliesin-arlein.net/names/search.php"
    br = mechanize.Browser()
    br.open(url)
    br.select_form(name="form")
    br["search_surname"] = prefix+'*'
    res = br.submit()
    content = res.read()
    return extractSurnames(content)

def extractSurnames(pageText):
    #write function to extract text from html


Q=[x for x in string.ascii_lowercase]
listOfSurnames=[]
while Q:
    curPrefix=Q.pop()
    print curPrefix
    curSurnames=checkPrefix(curPrefix)
    if len(curSurnames)<250:
        #store surnames could also write to file.
        listOfSurnames+=curSurnames
    else:
        #We clearly didnt get all of the names need to subdivide more
        Q+=[curPrefix+x for x in string.ascii_lowercase]
    time.sleep(5) # Sleep here to avoid overloading the server for other people.

因此，我们在有太多结果要显示的地方查询更多，但如果少于250个以ZZZ（或更短）开头的姓氏，我们不会查询ZZZZ。在不知道名称分布如何偏差的情况下，很难估计这需要多长时间，但是5秒睡眠乘以1080是1.5小时左右，所以如果不是更长时间你可能至少看半天。

注意：通过全局声明浏览器可以提高效率，但这是否合适取决于此代码的放置位置。

输出超过表单请求的有限结果

1 个答案: