Question

我有以下问题。我想从以下链接下载文本：

http://www.ncbi.nlm.nih.gov/nuccore/NC_021206.1?report=fasta&log$=seqview&format=text

我尝试了wget和curl，但是不下载文本文件，而是下载了一个html页面。有没有办法克服这个问题？

Answer 1

问题是，该服务器不返回真实文本文件，而是返回在客户端生成它的脚本。我想，这是自动浸出脚本的保护措施，就像你想要创建的那样。

但是，另一方面，这是非常蹩脚的措施，因为他们正在加载他们想要保护的文本与其他URL，在您的情况下：

http://www.ncbi.nlm.nih.gov/sviewer/viewer.fcgi?val=498907917&db=nuccore&dopt=fasta&extrafeat=0&fmt_mask=0&maxplex=1&sendto=t&withmarkup=on&log$=seqview&maxdownloadsize=1000000

所以，你应该做什么：

wget "whatever" -O temp.html
id=`cat temp.html | grep ncbi_uidlist | sed -e 's/^.*ncbi_uidlist\" content=\"//' | sed -e 's/".*//'`
wget "http://www.ncbi.nlm.nih.gov/sviewer/viewer.fcgi?val=$id&db=nuccore&dopt=fasta&extrafeat=0&fmt_mask=0&maxplex=1&sendto=t&withmarkup=on&log$=seqview&maxdownloadsize=1000000"

Answer 2

使用lynx。

它有-dump选项，可提供您正在寻找的功能。

wget / curl下载文件

2 个答案: