Question

我正在Bash shell中使用quality_variant_[accession_name].txt从Salk Arabidopsis 1001 Genomes site下载所有wget文件。

具有添加列表的主页：http://signal.salk.edu/atg1001/download.php
每个登录到页面的链接（例如http://signal.salk.edu/atg1001/data/Salk/accession.php?id=Aa_0，其中Aa_0是登录ID）包含三个链接：unsequenced_ [accession]，quality_variant_ [accession]和quality_variant_filtered_ [accession]
我只对quality_variant_ [accession]链接感兴趣（而不对quality_variant_filtered_ [accession]链接感兴趣），该链接将您带到具有序列数据（例如http://signal.salk.edu/atg1001/data/Salk/quality_variant_Aa_0.txt）的.txt文件

运行以下命令，最终将输出感兴趣的文件（但由于--spider参数而未下载），表明wget可以通过页面的超链接移动到我想要的文件。

wget --spider --recursive "http://signal.salk.edu/atg1001/download.php

我没有让命令运行足够长的时间来确定是否下载了感兴趣的文件，但是下面的命令确实开始递归下载站点。

# Arguments in brackets do not impact the performance of the command
wget -r [-e robots=off] [-m] [-np] [-nd] "http://signal.salk.edu/atg1001/download.php"

但是，无论何时使用.txt，--accept-regex或许多其他变体，只要尝试应用过滤器以提取感兴趣的--accept文件，我都无法超越最初的{ {1}}文件。

.php

我可以列出一个登录名，并循环访问这些名称，以修改wget命令中的URL，但是我希望有一个动态的单行代码，即使随着时间的推移添加了登录ID，也可以提取所有感兴趣的文件。

谢谢！

注意：感兴趣的数据文件包含在目录# This and variants thereof do not work wget -r -A "quality_variant_*.txt" "http://signal.salk.edu/atg1001/download.php" # Returns: # Saving to: ‘signal.salk.edu/atg1001/download.php.tmp’ # Removing signal.salk.edu/atg1001/download.php.tmp since it should be rejected.中，该目录也是访问该URL时显示的http://signal.salk.edu/atg1001/data/Salk/或静态HTML页面的目录。此URL不能在wget命令中使用，因为尽管感兴趣的数据文件包含在服务器端，但HTML页面不包含对这些文件的引用，而是链接到我不喜欢的一组.php文件不想。

使用wget递归获取.php文件中的.txt文件，但过滤器破坏了命令

0 个答案: