Question

我正在使用import.io从网站中提取信息，但我只是在电子邮件领域。我设法提取其他信息，但这对我来说有点混乱。

这是我需要提取的网站上的代码。该网站有几种价值，包括这种代码，几个电子邮件地址。

<td valign="top"><table width="100%" cellspacing="0" cellpadding="3" border="0" class="text_black-11">
<tbody>
  <tr>
    <td width="35" align="center" class="text_02-11"><img width="16" height="16" src="/interface/icon_www.png"></td>
    <td class="text_02-11"><a target="" href="http://www.website.com" class="text_02-11">Visit Website</a></td>
  </tr>
  <tr>
    <td width="35" align="center" class="text_02-11"><img width="19" height="12" src="/interface/icon_email.png"></td>
    <td class="text_02-11"><a target="" href="mailto:info@mail.com" class="text_02-11">Send Email</a></td>
  </tr>
</tbody>

Answer 1

如果您无法直接定位电子邮件，则href标记内的a始终为mailto //a[contains (@href, 'mailto:')]/@href 你可以试试这个

//a[contains (., 'Send Email')]/@href

或

(?<=mailto:).*

如果网站以这种方式构建

如果你想在选择后清理那个字段，你可以将这个正则表达式添加到import.io中的regex字段

git branch --contains <branch name> -r

确保xpath首先运行

我需要xpath从网站中提取这种代码

1 个答案: