Question

对于像这样结构的html页面：

YES

如何仅从名称标签中提取文本，最终得到列表<tr class=""> <td class="number">1</td> <td class="name"><a href="..." >Jack Green</a></td> <td class="score-cell "> <span class="display">98 <span class="tooltip column1"></span> </span> </td> <td class="score-cell "> ... </td> ... <tr class=""> <td class="number">2</td> <td class="name"><a href="..." target="_top">Nicole Smith</a></td> <td class="score-cell "> ... </td>？我希望有些方法优雅。

Answer 1

input =
  "          <tr class=\"\">
              <td class=\"number\">1</td>
              <td class=\"name\"><a href=\"...\" >Jack Green</a></td>
              <td class=\"score-cell \">
                <span class=\"display\">98
                  <span class=\"tooltip column1\"></span>
                </span>
              </td>
              <td class=\"score-cell \">
                ...
              </td>
            ...
            <tr class=\"\">
              <td class=\"number\">2</td>
              <td class=\"name\"><a href=\"...\" target=\"_top\">Nicole Smith</a></td>
              <td class=\"score-cell \">
               ...
              </td>";

(* Eliminate unnecessary whitespace and add a start character *)
html = StringJoin["X", StringReplace[StringTrim[input],
   {"\n" ~~ " " .. -> "", ">" ~~ " " .. ~~ "<" -> "><"}]];

(* Find the tags and positions of tags containing 'name' *)
tags = StringCases[html, "<" ~~ Except[">"] .. ~~ ">"];
nametagpositions = Position[StringMatchQ[ToLowerCase /@ tags, "*name*"], True];

(* Split on the tags and extract on the name tag positions *)
splits = StringSplit[html, "<" ~~ Except[">"] .. ~~ ">"];
Extract[splits, nametagpositions + 2]

{杰克格林，妮可史密斯}

注意

需要起始字符以保证正确的分割。正如您在下面的演示中所看到的，在有要报告的子字符串之前，不计算a个字符之间的初始拆分。使用起始字符可以可靠地使用所需项目的位置。

html = "aa1aaa2aa"; splits = StringSplit[html, "a"]


{1 ,,,}}

html = "aaaaaaa1aaa2aaaaaaa"; splits = StringSplit[html, "a"]


{1 ,,,}}

html = "0aaaaaaa1aaa2aaaaaaa"; splits = StringSplit[html, "a"]


{0 ,,,,,,, 1 ,,, 2}

使用Mathematica从html中的特定标签中提取文本

1 个答案: