I need to get the content of an <a>
html tag by a certain css class name.
The css class that I need find is: whtbigheader
What I done so far is this:
content = "<A HREF='/articles/0,7340,L-4664450,00.html' CLASS='whtbigheader' style='color:#FFFFFF;' HM=1>need to get this value</A>";
Pattern p = Pattern.compile("<A.+?class\\s*?=[whtbigheader]['\"]?([^ '\"]+).*?>(.*?)</A>");
Matcher m = p.matcher(content);
if (m.find()) {
System.out.println("found");
System.out.println(m.group(1));
}
else {
System.out.println("not found");
}
The expected value is: need to get this value
More info:
Any ideas how to find it?
答案 0 :(得分:4)
我是regex
使用html parsing
的憎恨,这就是为什么解决方案可能不是请求者所希望的原因:
使用Jsoup来实现这一目标:
String html; // your html code
Document doc = Jsoup.parse(html);
Elements elements=doc.select(".whtbigheader")` //<-- that's it, it contains all the tags with whtbigheader as its class.
确保您只获得a
代码:
Elements elements=doc.select("a").select(".whtbigheader");
从你那里获取文本只需要遍历元素并获取文本:
for(Element element : elements){
System.out.println(element.text());
}
下载链接:
下载Jsoup 1.8.2点击here:)。
答案 1 :(得分:1)
使用非捕获组而不是方括号来匹配单词。
Pattern p = Pattern.compile("(?i)<A.+?class\\s*?=(['\"])?(?:whtbigheader)\\1[^>]*>(.*?)</A>");
Matcher m = p.matcher(content);
if (m.find()) {
System.out.println("found");
System.out.println(m.group(2));
}
else {
System.out.println("not found");
}
答案 2 :(得分:1)
解析器是从HTML中提取信息的更健壮的方法。但是,在这种情况下,可以使用正则表达式来获得你想要的东西(假设你永远不会有嵌套的锚标签 - 如果你有嵌套的锚标签,那么你可能想要理智地检查你的文件,你会肯定需要一个解析器。)
您可以使用以下正则表达式(使用不区分大小写的标记):
"<a\\s+(?:[^>]+\\s+)?bclass\\s*=\\s*(?:whtbigheader(?=\\s|>)|(['\"])(?:(?:(?!\\1).)*?\\s+)*whtbigheader(?:\\s+(?:(?!\\1).)*?)*\\1)[^>]*>(.*?)</a>"
您想要像这样提取第二组匹配:
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Test {
static final Pattern ANCHOR_PATTERN = Pattern.compile(
"<a\\s+(?:[^>]+\\s+)?class\\s*=\\s*(?:whtbigheader(?=\\s|>)|(['\"])(?:(?:(?!\\1).)*?\\s+)*whtbigheader(?:\\s+(?:(?!\\1).)*?)*\\1)[^>]*>(.*?)</a>",
Pattern.CASE_INSENSITIVE
);
public static String getAnchorContents( final String html ){
final Matcher matcher = ANCHOR_PATTERN.matcher( html );
if ( matcher.find() ){
return matcher.group(2);
}
return null;
}
public static void main( final String[] args ){
final String[] tests = {
"<a class=whtbigheader>test</a>",
"<a class=\"whtbigheader\">test</a>",
"<a class='whtbigheader'>test</a>",
"<a class =whtbigheader>test</a>",
"<a class =\"whtbigheader\">test</a>",
"<a class ='whtbigheader'>test</a>",
"<a class= whtbigheader>test</a>",
"<a class= \"whtbigheader\">test</a>",
"<a class= 'whtbigheader'>test</a>",
"<a class = whtbigheader>test</a>",
"<a class\t=\r\n\"whtbigheader\">test</a>",
"<a class =\t'whtbigheader'>test</a>",
"<a class=\"otherclass whtbigheader\">test</a>",
"<a class=\"whtbigheader otherclass\">test</a>",
"<a class=\"whtbigheader2 whtbigheader\">test</a>",
"<a class=\"otherclass whtbigheader otherotherclass\">test</a>",
"<a class=whtbigheader href=''>test</a>",
};
int successes = 0;
int failures = 0;
for ( final String test : tests )
{
final String contents = getAnchorContents( test );
if ( "test".equals( contents ) )
successes++;
else
{
System.err.println( test + " => " + contents );
failures++;
}
}
final String[] failingTests = {
"<a class=whtbigheader2>test</a>",
"<a class=awhtbigheader>test</a>",
"<a class=whtbigheader-other>test</a>",
"<a class='whtbigheader2'>test</a>",
"<a class='awhtbigheader'>test</a>",
"<a class='whtbigheader-other'>test</a>",
"<a class=otherclass whtbigheader>test</a>",
"<a class='otherclass' whtbigheader='value'>test</a>",
"<a class='otherclass' id='whtbigheader'>test</a>",
"<a><aclass='whtbigheader'>test</aclass></a>",
"<a aclass='whtbigheader'>test</a>",
"<a class='whtbigheader\"'>test</a>",
"<ab class='whtbigheader'><a>test</a></ab>",
};
for ( final String test : failingTests )
{
final String contents = getAnchorContents( test );
if ( contents == null )
successes++;
else
{
System.err.println( test + " => " + contents );
failures++;
}
}
System.out.println( "Successful tests: " + successes );
System.out.println( "Failed tests: " + failures );
}
}
答案 3 :(得分:0)
您可以使用以下正则表达式:
/<a[^>]*class=\s?['"]\s?whtbigheader\s?['"][^>]*>(.*?)</a>/i
请注意,如果您只想要标记a
的内容与某个类,那么您不需要在标记内添加额外的正则表达式a[^>]*class='whtbigheader'[^>]*
就可以完成这项任务:
[^>]*
将匹配除>
此外,您需要使用修饰符i
(IGNORE CASE
)来忽略大小写!
此外,正则表达式不是解析(?:X|H)TML
文档的正确方法。您可以考虑使用正确的Parser。
请注意,如果您使用正则表达式的引号,则需要转义类名称旁边的引号。