我正在使用Java和Jsoup来提取div
标记的内容。我只需要提取数字。
String html = "";
Document document = Jsoup.parse(html);
Elements divs = document.select("div");
for (Element div : divs) {
System.out.println(div.ownText());
}
,输出就像这样
Adidas, 45-46 Nike, 25 shoes, phone, keyboard, 1–2, 4–5, 7, 9, 12, 13, 32, 35,
我的问题是如何提取div
代码的数字内容?每个号码在需要之前都有逗号。那么如何使用正则表达式呢?谢谢
更新:如何提取数字和罗马数字?
Adidas, 45-46 Nike, 25 shoes, phone, keyboard, 1–2, 4–5, 7, 9, 12, 13, 32, 35, V, VI, IX,
此帖与上述链接不同,因为我的问题需要提取罗马数字
答案 0 :(得分:0)
您可以使用此正则表达式:
\b(\d+(-\d+)?|(M{1,4}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3})|M{0,4}(CM|C?D|D?C{1,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3})|M{0,4}(CM|CD|D?C{0,3})(XC|X?L|L?X{1,3})(IX|IV|V?I{0,3})|M{0,4}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|I?V|V?I{1,3})))\b
演示: https://regex101.com/r/rW1mY1/3
\b
用于字边界。(M{1,4}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3})|M{0,4}(CM|C?D|D?C{1,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3})|M{0,4}(CM|CD|D?C{0,3})(XC|X?L|L?X{1,3})(IX|IV|V?I{0,3})|M{0,4}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|I?V|V?I{1,3}))
这是罗马数字验证器。我从这里得到它:How do you match only valid roman numerals with a regular expression? \d+(-\d+)?
匹配数字和可选的数字范围答案 1 :(得分:0)
此正则表达式将执行以下操作:
正则表达式
\b(?:\d+(?:-\d+)?|(?=[MCDLXVI]+\b)M{0,4}(?:CM|CD|D?C{0,3})(?:XC|XL|L?X{0,3})(?:IX|IV|V?I{0,3}))\b
请注意,这只是一个原始正则表达式,对于像Java这样的许多语言,您需要将\
替换为\\
才能使其正常工作。
NODE EXPLANATION
----------------------------------------------------------------------
\b the boundary between a word char (\w) and
something that is not a word char
----------------------------------------------------------------------
(?: group, but do not capture:
----------------------------------------------------------------------
\d+ digits (0-9) (1 or more times (matching
the most amount possible))
----------------------------------------------------------------------
(?: group, but do not capture (optional
(matching the most amount possible)):
----------------------------------------------------------------------
- '-'
----------------------------------------------------------------------
\d+ digits (0-9) (1 or more times
(matching the most amount possible))
----------------------------------------------------------------------
)? end of grouping
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
(?= look ahead to see if there is:
----------------------------------------------------------------------
[MCDLXVI]+ any character of: 'M', 'C', 'D', 'L',
'X', 'V', 'I' (1 or more times
(matching the most amount possible))
----------------------------------------------------------------------
\b the boundary between a word char (\w)
and something that is not a word char
----------------------------------------------------------------------
) end of look-ahead
----------------------------------------------------------------------
M{0,4} 'M' (between 0 and 4 times (matching the
most amount possible))
----------------------------------------------------------------------
(?: group, but do not capture:
----------------------------------------------------------------------
CM 'CM'
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
CD 'CD'
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
D? 'D' (optional (matching the most
amount possible))
----------------------------------------------------------------------
C{0,3} 'C' (between 0 and 3 times (matching
the most amount possible))
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------
(?: group, but do not capture:
----------------------------------------------------------------------
XC 'XC'
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
XL 'XL'
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
L? 'L' (optional (matching the most
amount possible))
----------------------------------------------------------------------
X{0,3} 'X' (between 0 and 3 times (matching
the most amount possible))
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------
(?: group, but do not capture:
----------------------------------------------------------------------
IX 'IX'
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
IV 'IV'
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
V? 'V' (optional (matching the most
amount possible))
----------------------------------------------------------------------
I{0,3} 'I' (between 0 and 3 times (matching
the most amount possible))
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------
\b the boundary between a word char (\w) and
something that is not a word char
----------------------------------------------------------------------
现场演示
示例文字
Adidas, 45-46 Nike, 25 shoes, phone, keyboard, 1-2, 4-5, 7, 9, 12, 13, 32, 35, V, VI, IX
Java代码示例
import java.util.regex.Pattern;
import java.util.regex.Matcher;
class Module1{
public static void main(String[] asd){
String sourcestring = "Adidas, 45-46 Nike, 25 shoes, phone, keyboard, 1-2, 4-5, 7, 9, 12, 13, 32, 35, V, VI, IX";
Pattern re = Pattern.compile("\\b(?:\\d+(?:-\\d+)?|(?=[MCDLXVI]+\\b)M{0,4}(?:CM|CD|D?C{0,3})(?:XC|XL|L?X{0,3})(?:IX|IV|V?I{0,3}))\\b",Pattern.CASE_INSENSITIVE );
Matcher m = re.matcher(sourcestring);
int mIdx = 0;
while (m.find()){
for( int groupIdx = 0; groupIdx < m.groupCount()+1; groupIdx++ ){
System.out.println( "[" + mIdx + "][" + groupIdx + "] = " + m.group(groupIdx));
}
mIdx++;
}
}
}
匹配数组
$matches Array:
(
[0] => Array
(
[0] => 45-46
[1] => 25
[2] => 1-2
[3] => 4-5
[4] => 7
[5] => 9
[6] => 12
[7] => 13
[8] => 32
[9] => 35
[10] => V
[11] => VI
[12] => IX
)
)