我想从几个不同的网站中提取电子邮件地址。如果它们处于活动链接格式,我可以使用
执行此操作//A[starts-with(@href, 'mailto:')]
但其中一些只是文本格式example@domain.com
,而不是链接,所以我想选择一个包含@
内部元素的路径
答案 0 :(得分:3)
您可能想要使用regular expression。它们允许您提取电子邮件地址,无论文档中的上下文如何。这是一个小小的测试驱动示例,可以帮助您入门:
require "minitest/spec"
require "minitest/autorun"
module Extractor
EMAIL_REGEX = /[\w]+@[\w]+\.[\w]+/
def self.emails(document)
(matches = document.scan(EMAIL_REGEX)).any? ? matches : false
end
end
describe "Extractor" do
it 'should extract an email address from plaintext' do
emails = Extractor.emails("email@example.com")
emails.must_include "email@example.com"
end
it 'should extract multiple email addresses from plaintext' do
emails = Extractor.emails("email@example.com and email2@example2.com")
emails.must_include "email@example.com", "email2@example2.com"
end
it 'should extract an email address from the href attribute of an anchor' do
emails = Extractor.emails("<a href='mailto:email3@example3.com'>Email!</a>")
emails.must_include "email3@example3.com"
end
it 'should extract multiple email addresses from both plaintext and within HTML' do
emails = Extractor.emails("my@email.com OR <a href='mailto:email4@example4.com'>Email!</a>")
emails.must_include "email4@example4.com", "my@email.com"
end
it 'should not extract an email address if there isn\'t one' do
emails = Extractor.emails("email(at)address(dot)com")
emails.must_equal false
end
it "should extract email addresses" do
emails = Extractor.emails("email.address@domain.co.uk")
emails.must_include "email.address@domain.co.uk"
end
end
最后一次测试失败,因为正则表达式不会预期大多数有效的电子邮件地址。看看你是否以此为出发点或找到更好的正则表达式。要帮助构建正则表达式,请查看Rubular。
答案 1 :(得分:3)
我想选择包含@ inside
的元素的路径
使用强>:
//*[contains(., '@')]
在我看来,你真正想要的是选择具有包含“@”的文本节点子元素的元素。如果是这样,请使用:
//*[contains(text(), '@')]
基于XSLT的验证:
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output omit-xml-declaration="yes" indent="yes"/>
<xsl:strip-space elements="*"/>
<xsl:template match="/">
<xsl:copy-of select=
"//*[contains(text(), '@')] "/>
</xsl:template>
</xsl:stylesheet>
将此转换应用于以下XML文档时:
<html>
<body>
<a href="xxx.com">xxx.com</a>
<span>someone@xxx.com</span>
</body>
</html>
评估XPath表达式并将选定的节点复制到输出:
<span>someone@xxx.com</span>