Question

我想验证一长串URL字符串，但其中一些包含变音字符，例如：ä，à，è，ö等。

有没有办法配置Apache Commons UrlValidator接受这些字符？

此测试失败（请注意ã）：

@Test
public void urlValidatorShouldPassWithUmlaut()
{
    // Given
    org.apache.commons.validator.routines.UrlValidator validator;
    validator = new UrlValidator( new String[] { "http", "https" }, UrlValidator.ALLOW_ALL_SCHEMES );

    // When
    String url = "http://dbpedia.org/resource/São_Paulo";

    // Then
    assertThat( validator.isValid( url ), is( true ) );
}

此测试通过（ã替换为a）：

@Test
public void urlValidatorShouldPassWithUmlaut()
{
    // Given
    org.apache.commons.validator.routines.UrlValidator validator;
    validator = new UrlValidator( new String[] { "http", "https" }, UrlValidator.ALLOW_ALL_SCHEMES );

    // When
    String url = "http://dbpedia.org/resource/Sao_Paulo";

    // Then
    assertThat( validator.isValid( url ), is( true ) );
}

软件版本：

<dependency>
    <groupId>commons-validator</groupId>
    <artifactId>commons-validator</artifactId>
    <version>1.4.0</version>
</dependency>

更新

validator.isValid( IDN.toASCII(url) )也失败，因为IDN.toASCII(url)做了我还不了解的事情，例如它将http://dbpedia.org/resource/São_Paulo转换为http://dbpedia.xn--org/resource/so_paulo-w1b，根据UrlValidator

，它仍然无效

Answer 1

您必须先对变音符部分进行编码，然后才能将其验证为：

import org.apache.commons.validator.routines.UrlValidator;

import java.io.UnsupportedEncodingException;
import java.net.URLEncoder;

public class UmlautUrlTest {
    public static void main(String[] args) {
        String url = "http://dbpedia.org/resource/";
        String umlautPart="São_Paulo";
        UrlValidator v= null;
        try {
            String s[]={"http", "https"};
            v = new UrlValidator(s, UrlValidator.ALLOW_ALL_SCHEMES);
            String encodedUrl=URLEncoder.encode(umlautPart,"UTF-8");
            System.out.println(v.isValid(url+encodedUrl));
        } catch (UnsupportedEncodingException e) {
            e.printStackTrace();  //To change body of catch statement use File | Settings | File Templates.
        }
    }
}

输出结果为：

true
S%C3%A3o_Paulo

修改

您可以使用此函数对整个网址进行编码以进行解析。

public static String encodeUrl(String url) { String temp[] = url.split("://"); String protocol = temp[0]; String restOfUrl = temp[1]; temp = restOfUrl.split("\\."); //for the all except last token of host for (int i = 0; i < temp.length - 1; i++) { try { temp[i] = URLEncoder.encode(temp[i], "UTF-8"); } catch (UnsupportedEncodingException e) { e.printStackTrace(); //To change body of catch statement use File | Settings | File Templates. } } String temp2[] = temp[temp.length - 1].split("/"); String host = ""; for (int i = 0; i < temp.length - 1; i++) { host = host + temp[i]; } try { host = host + "." + URLEncoder.encode(temp2[0], "UTF-8"); } catch (UnsupportedEncodingException e) { e.printStackTrace(); //To change body of catch statement use File | Settings | File Templates. } host = host.substring(0); String remainingPart = ""; for (int i = 1; i < temp2.length; i++) { try { remainingPart = remainingPart + "/" + URLEncoder.encode(temp2[i], "UTF-8"); } catch (UnsupportedEncodingException e) { e.printStackTrace(); //To change body of catch statement use File | Settings | File Templates. } } return (protocol + "://" + host + remainingPart); }

并在测试中使用：validator.isValid(encodeUrl(url))

Answer 2

在阅读这个问题（Regex: what is InCombiningDiacriticalMarks?）时，我发现另一个部分解决方案如下：

    public static boolean removeAccentsAndValidateUrl( String url )
    {
        String normalizedUrl = Normalizer.normalize( url, Normalizer.Form.NFD );
        Pattern accentsPattern = Pattern.compile( "\\p{InCombiningDiacriticalMarks}+" );
        String urlWithoutAccents = accentsPattern.matcher( normalizedUrl ).replaceAll( "" );
        String[] schemes = {"http", "https"};
        long options = UrlValidator.ALLOW_ALL_SCHEMES;
        UrlValidator urlValidator = new UrlValidator( schemes, options );
        return urlValidator.isValid(urlWithoutAccents);
    }

然而，事实证明UrlValidator也失败了（除其他外）“ - ”字符。

例如，以下验证失败：

http://dbpedia.org/resource/PENTA_–_Pena_Transportes_Aereos

Apache Commons UrlValidator - 配置允许变音字符

2 个答案: