Question

我需要阅读网页的html，然后找到链接和图片，然后重命名链接和图片，我做了什么

reader = new BufferedReader(new InputStreamReader(socket.getInputStream(), 'UTF-8'));  
String line;  
while ((line = reader.readLine()) != null) { 
    regex = "<a[^>]*href=(\"([^\"]*)\"|\'([^\']*)\'|([^\\s>]*))[^>]*>(.*?)</a>";  
    final Pattern pa = Pattern.compile(regex, Pattern.DOTALL);  
    final Matcher ma = pa.matcher(s);  
    if(ma.find()){  
        string newlink=path+"1-2.html";
        //replace the link in href with newlink, how can i do this?
    }  
    html.append(line).append("/r/n");  
}

我该如何做评论部分

Answer 1

使用正则表达式解析HTML可能很困难且不可靠。对于类似的东西，最好使用XPath和DOM操作。

Answer 2

提到了替代方案：

Matcher支持使用StringBuffer“替换所有”。
匹配的文本必须部分作为替换文本进行读取，因此所有文本都必须位于ma.group(1)（2,3，...）。
DOTALL会让.匹配换行符，而不需要使用readLine来删除换行符。
每行可能有多个链接。
示例代码中有matcher(s)而不是matcher(line)。

因此代码使用Matcher.appendReplacement and appendTail。

StringBuffer html = new StringBuffer();
reader = new BufferedReader(new InputStreamReader(socket.getInputStream(), 'UTF-8'));  
String line;  
regex = "(<a[^>]*href=)(\"([^\"]*)\"|\'([^\']*)\'|([^\\s>]*))[^>]*>(.*?)(</a>)";  
final Pattern pa = Pattern.compile(regex);
while ((line = reader.readLine()) != null) {
    final Matcher ma = pa.matcher(line);
    while (ma.find()) {
        string newlink=path+"1-2.html";
        ma.appendReplacement(html, m.group(1) /* a href */ + ...);
    }
    ma.appendTail(html);
    html.append(line).append("/r/n");  
}

Java替换链接中的内容

2 个答案: