如何使用JSOUP和regex从公司页面获取电子邮件ID

时间:2016-10-26 06:28:34

标签: regex email jsoup

我正在尝试从公司页面http://customercarecontacts.com/contact-infosys-phone-address-of-infosys-offices/

获取电子邮件ID和链接

我成功获得链接,但我没有收到电子邮件。我尝试了很多方法但失败了。这是我正在尝试的代码

import java.io.IOException;
import java.util.HashSet;
import java.util.Set;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import org.jsoup.nodes.Document;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

public class JSoupTest {

public static void main(String[] args) throws IOException {
    Document doc = Jsoup.connect("http://customercarecontacts.com/contact-infosys-phone-address-of-infosys-offices/").userAgent("Mozilla/5.0").timeout(5000).get();

    Pattern p = Pattern.compile("[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\\.[a-zA-Z0-9-.]+");
    Matcher matcher = p.matcher(doc.text());
    Set<String> emails = new HashSet<String>();
    while (matcher.find()) {
        emails.add(matcher.group());
    }

    Set<String> links = new HashSet<String>();

    Elements elements = doc.select("a[href]");
    for (Element e : elements) {
        links.add(e.attr("href"));
    }

    System.out.println("emails : "+emails);
    System.out.println("links : "+links);

}

}

任何人都可以建议获取电子邮件的方式或解决方案。

1 个答案:

答案 0 :(得分:0)

你可以试试这个:

[a-zA-Z0-9_.+-]+(@\\w+|\\s*\\(at\\)\\s*\\w+)\\.[a-zA-Z]+

Explanation

  

示例Java代码

final String regex = "[a-zA-Z0-9_.+-]+(@\\w+|\\s*\\(at\\)\\s*\\w+)\\.[a-zA-Z]+";
final String string = "df\n"
     + "askus (at) infosys.com (queries)<br />\n"
     + "asdfasdf\n"
     + "asdfasdf\n"
     + "asdf abc@yahoo.com asdfadsf\n"
     + "asdf pqr@google.com a sdfasfd\n\n\n";

final Pattern pattern = Pattern.compile(regex);
final Matcher matcher = pattern.matcher(string);

while (matcher.find()) {
    System.out.println("Full match: " + matcher.group(0));
    for (int i = 1; i <= matcher.groupCount(); i++) {
        System.out.println("Group " + i + ": " + matcher.group(i));
    }
}