对于像这样结构的html页面:
YES
如何仅从名称标签中提取文本,最终得到列表 <tr class="">
<td class="number">1</td>
<td class="name"><a href="..." >Jack Green</a></td>
<td class="score-cell ">
<span class="display">98
<span class="tooltip column1"></span>
</span>
</td>
<td class="score-cell ">
...
</td>
...
<tr class="">
<td class="number">2</td>
<td class="name"><a href="..." target="_top">Nicole Smith</a></td>
<td class="score-cell ">
...
</td>
?我希望有些方法优雅。
答案 0 :(得分:2)
input =
" <tr class=\"\">
<td class=\"number\">1</td>
<td class=\"name\"><a href=\"...\" >Jack Green</a></td>
<td class=\"score-cell \">
<span class=\"display\">98
<span class=\"tooltip column1\"></span>
</span>
</td>
<td class=\"score-cell \">
...
</td>
...
<tr class=\"\">
<td class=\"number\">2</td>
<td class=\"name\"><a href=\"...\" target=\"_top\">Nicole Smith</a></td>
<td class=\"score-cell \">
...
</td>";
(* Eliminate unnecessary whitespace and add a start character *)
html = StringJoin["X", StringReplace[StringTrim[input],
{"\n" ~~ " " .. -> "", ">" ~~ " " .. ~~ "<" -> "><"}]];
(* Find the tags and positions of tags containing 'name' *)
tags = StringCases[html, "<" ~~ Except[">"] .. ~~ ">"];
nametagpositions = Position[StringMatchQ[ToLowerCase /@ tags, "*name*"], True];
(* Split on the tags and extract on the name tag positions *)
splits = StringSplit[html, "<" ~~ Except[">"] .. ~~ ">"];
Extract[splits, nametagpositions + 2]
{杰克格林,妮可史密斯}
注意强>
需要起始字符以保证正确的分割。正如您在下面的演示中所看到的,在有要报告的子字符串之前,不计算a
个字符之间的初始拆分。使用起始字符可以可靠地使用所需项目的位置。
html = "aa1aaa2aa";
splits = StringSplit[html, "a"]
{1 ,,,}}
html = "aaaaaaa1aaa2aaaaaaa";
splits = StringSplit[html, "a"]
{1 ,,,}}
html = "0aaaaaaa1aaa2aaaaaaa";
splits = StringSplit[html, "a"]
{0 ,,,,,,, 1 ,,, 2}