使用Mathematica从html中的特定标签中提取文本

时间:2015-07-13 21:34:08

标签: html xml parsing tags wolfram-mathematica

对于像这样结构的html页面:

YES

如何仅从名称标签中提取文本,最终得到列表 <tr class=""> <td class="number">1</td> <td class="name"><a href="..." >Jack Green</a></td> <td class="score-cell "> <span class="display">98 <span class="tooltip column1"></span> </span> </td> <td class="score-cell "> ... </td> ... <tr class=""> <td class="number">2</td> <td class="name"><a href="..." target="_top">Nicole Smith</a></td> <td class="score-cell "> ... </td> ?我希望有些方法优雅。

1 个答案:

答案 0 :(得分:2)

input =
  "          <tr class=\"\">
              <td class=\"number\">1</td>
              <td class=\"name\"><a href=\"...\" >Jack Green</a></td>
              <td class=\"score-cell \">
                <span class=\"display\">98
                  <span class=\"tooltip column1\"></span>
                </span>
              </td>
              <td class=\"score-cell \">
                ...
              </td>
            ...
            <tr class=\"\">
              <td class=\"number\">2</td>
              <td class=\"name\"><a href=\"...\" target=\"_top\">Nicole Smith</a></td>
              <td class=\"score-cell \">
               ...
              </td>";

(* Eliminate unnecessary whitespace and add a start character *)
html = StringJoin["X", StringReplace[StringTrim[input],
   {"\n" ~~ " " .. -> "", ">" ~~ " " .. ~~ "<" -> "><"}]];

(* Find the tags and positions of tags containing 'name' *)
tags = StringCases[html, "<" ~~ Except[">"] .. ~~ ">"];
nametagpositions = Position[StringMatchQ[ToLowerCase /@ tags, "*name*"], True];

(* Split on the tags and extract on the name tag positions *)
splits = StringSplit[html, "<" ~~ Except[">"] .. ~~ ">"];
Extract[splits, nametagpositions + 2]
  

{杰克格林,妮可史密斯}

注意

需要起始字符以保证正确的分割。正如您在下面的演示中所看到的,在有要报告的子字符串之前,不计算a个字符之间的初始拆分。使用起始字符可以可靠地使用所需项目的位置。

html = "aa1aaa2aa";
splits = StringSplit[html, "a"]
  

{1 ,,,}}

html = "aaaaaaa1aaa2aaaaaaa";
splits = StringSplit[html, "a"]
  

{1 ,,,}}

html = "0aaaaaaa1aaa2aaaaaaa";
splits = StringSplit[html, "a"]
  

{0 ,,,,,,, 1 ,,, 2}