Question

我有一个文本文件，其中包含多个URL以及URL的其他信息。如何读取txt文件并仅将URL保存在数组中以进行下载？我想用

C = textscan(fileId, formatspec);

我应该在formatspec中为URL格式化什么？

Answer 1

这不是textscan的工作;你应该使用regular expressions。在MATLAB中，描述了正则表达式here。对于网址，请参阅here或here以获取其他语言的示例。

这是MATLAB中的一个例子：

% This string is obtained through textscan or something
str = {...
    'pre-URL garbage http://www.example.com/index.php?query=test&otherStuf=info more stuff here'
    'other foolish stuff ftp://localhost/home/ruler_of_the_world/awesomeContent.py 1 2 3 4 misleading://';
};


% find URLs    
C = regexpi(str, ...
    ['((http|https|ftp|file)://|www\.|ftp\.)',...
    '[-A-Z0-9+&@#/%=~_|$?!:,.]*[A-Z0-9+&@#/%=~_|$]'], 'match');

C{:}

结果：

ans = 
    'http://www.example.com/index.php?query=test&otherStuf=info'
ans = 
    'ftp://localhost/home/ruler_of_the_world/awesomeContent.py'

请注意，此正则表达式要求您包含协议，或具有前导www.或ftp.。像example.com/universal_remote.cgi?redirect=这样的东西不匹配。

你可以继续让正则表达式覆盖越来越多的案例。但是，最终你会偶然发现最重要的结论（例如here;我得到了我的正则表达式）：给定完整定义了什么是精确构成有效的URL ，没有单正则表达式能始终匹配每个有效的URL。也就是说，您可以想到的有效网址是由所显示的任何正则表达式捕获的。

但请记住，最后一条陈述更具理论性而非实际性 - 那些不匹配的URL有效，但在实践中并不经常遇到:)换句话说，如果您的网址有一个非常标准的形式，那么''我给你的正则表达式几乎覆盖了我。

现在，我在pm89之前就Java建议愚弄了一下。正如我所怀疑的那样，它比正则表达式慢一个数量级，因为你在代码中引入了另一个“粘性层”（在我的时间中，差异大约慢40倍，不包括导入）。这是我的版本：

import java.net.URL;
import java.net.MalformedURLException;

str = {...
    'pre-URL garbage http://www.example.com/index.php?query=test&otherStuf=info more stuff here'
    'pre--URL garbage example.com/index.php?query=test&otherStuf=info more stuff here'
    'other foolish stuff ftp://localhost/home/ruler_of_the_world/awesomeContent.py 1 2 3 4 misleading://';
};


% Attempt to convert each item into an URL.  
for ii = 1:numel(str)    
    cc = textscan(str{ii}, '%s');
    for jj = 1:numel(cc{1})
        try
            url = java.net.URL(cc{1}{jj})

        catch ME
            % rethrow any non-url related errors
            if isempty(regexpi(ME.message, 'MalformedURLException'))
                throw(ME);
            end

        end
    end
end

结果：

url =
    'http://www.example.com/index.php?query=test&otherStuf=info'
url =
    'ftp://localhost/home/ruler_of_the_world/awesomeContent.py'

我对java.net.URL不太熟悉，但显然，如果没有领先协议或标准域（例如example.com/path/to/page），它也无法找到网址。

这个片段无疑可以改进，但我会敦促你考虑为什么你要为这个更长，本来就更慢和更丑陋的解决方案做这件事:)

Answer 2

我怀疑您可以根据this answer使用java.net.URL。

在 Matlab 中实现相同的代码：

首先使用fileread将文件读入字符串，例如：

str = fileread('Sample.txt');

然后使用strsplit：

将文本相对于空格分开

spl_str = strsplit(str);

最后使用java.net.URL来检测网址：

for k = 1:length(spl_str)
    try
       url = java.net.URL(spl_str{k})
       % Store or save the URL contents here
    catch e
       % it's not a URL.
    end
end

您可以使用urlwrite将网址内容写入文件。但首先将从java.net.URL获得的网址转换为char：

url = java.net.URL(spl_str{k});
urlwrite(char(url), 'test.html');

希望它有所帮助。

如何在MATLAB中只读取文本文件中的URL

2 个答案: