Question

我在数据库中有大约100k的URL，我想检查所有的URL是否有效。我尝试使用PHP和curl但它非常慢并且给脚本超时。有没有更好的方法来使用其他一些shell脚本？

到目前为止，我试过这个：

// By default get_headers uses a GET request to fetch the headers. If you
// want to send a HEAD request instead, you can do so using a stream context:
stream_context_set_default(
    array(
        'http' => array(
            'method' => 'HEAD'
        )
    )
);
$headers = get_headers('http://example.com');

它正在for循环中运行。

Answer 1

服务器回复存在大量延迟，因此这个问题有助于并行化。尝试将列表拆分为多个子列表并并行运行脚本，每个子列表处理不同的列表。

尝试查看split命令以生成列表。

所以，你会得到这样的东西：

#!/bin/bash
split -l 1000 urllist.txt tmpurl       # split bigfile into 1000 line subfiles called tmpurl*
for p in tmpurl*                       # for all tmpurl* files
do
   # Start a process to check the URLs in that list
   echo start checking file $p in background &    
done
wait                                   # till all are finished

我已经把“开始检查文件$ p在后台”你需要提供一个简单的PHP或shell脚本，它接受一个文件名作为参数（或从其stdin中读取）并在for循环中进行检查文件中的网址，但你已经在做了。

额外信息：

为了好玩，我使用curl列出了每个网址的1,000个网址和curl -I -s个标题。在连续的情况下，花了4分19秒。当我使用上面的脚本将1,000个URL分成每个文件中的100个子列表并启动10个进程时，整个测试耗时22秒 - 所以速度提高了12倍。将列表拆分为50个URL的子列表，导致20个进程在14秒内完成。所以，正如我所说，问题很容易并行化。

Answer 2

您可以使用mechanize python模块访问网站并从中获取响应

多个网址存在检查

2 个答案: