从复杂的字符串中提取电话号码

时间:2016-07-21 14:20:02

标签: regex

假设我们的地址簿中包含一些未格式化的数据,例如:

  

+1(4542)114214 111@111.org d@ghhg.com ,,,,

     

+1(2342)114234 ert@nhy.sdfr.domain.org; 1@kjk.eiu.1

     

+7(101)111-222-11 abc @ el.com,def @ sdf.org

     

+1(102)123532-2 some@mail.ru

     

+44(301)123 23 45 7zip@site.edu; ret@ghjj.org

我尝试为此编写正则表达式:

  

/ + \ d + \ s(\ d +)\ s \ d + [\ d + \ s | \ d + - ] + / g

但我不知道如何在字母字符前排除数字。可能这甚至不是部分解决方案。

编辑#1:我对所提供的所有工作解决方案感到不知所措,非常感谢大家。如果可能的话,如果您至少添加一些参考/解释如何编写这样复杂的正则表达式,我将不胜感激。

5 个答案:

答案 0 :(得分:0)

这可能是少数情况之一,您需要possessive quantifier

我的attempt

\s*(\+?(\d+)\s*\(\d+\)\s+([- \d+]++(?!\@)|\d+))

如果跟随“@”,则[- \d+]++(?!\@)部分将停止匹配。因此,它不包括电子邮件地址。

电话号码现在存储在第1组中。

修改 是的,最后一个输入行与correctley不匹配。使用以下正则表达式提取电子邮件地址可能更容易,因此保留了电话号码(还有一些逗号,但它们也应该是一个问题):

\s[^\@ ]+\@[-\w.]+\.\w+

答案 1 :(得分:0)

如果不知道你在哪里使用这个正则表达式,我建议使用否定前瞻。

^[+\d() -]+(?![\w@])

演示:https://regex101.com/r/rQ6fK4/1

如果您想捕获电话号码,请使用:

^([+\d() -]+)(?![\w@])

它将位于$1\1,(取决于您使用此处的位置)。

答案 2 :(得分:0)

您可以使用demo

(?<phone>\+\d{1,2}\s\(\d{3,4}\)\s(?:[\d- ]+\d)(?=\s)) 
\s+(?<email>.*?@.*?)(?=[\s;,]|$).*?
\s+(?<email2>[\w]*?@.*?)?(?=[\s;,]|$)

哪种产品:

MATCH 1
phone   [4-20]  `+1 (4542) 114214`
email   [21-32] `111@111.org`
email2  [33-43] `d@ghhg.com`
MATCH 2
phone   [52-68] `+1 (2342) 114234`
email   [69-92] `ert@nhy.sdfr.domain.org`
email2  [94-105]    `1@kjk.eiu.1`
MATCH 3
phone   [110-129]   `+7 (101) 111-222-11`
email   [130-141]   `abc@ert.com`
email2  [143-154]   `def@sdf.org`
MATCH 4
phone   [159-176]   `+1 (102) 123532-2`
email   [177-189]   `some@mail.ru`
MATCH 5
phone   [194-213]   `+44 (301) 123 23 45`
email   [214-227]   `7zip@site.edu`
email2  [229-241]   `ret@ghjj.org`

说明:

(?<phone>\+\d{1,2}\s\(\d{3,4}\)\s(?:[\d- ]+\d)(?=\s)) Named capturing group phone

    \+ matches the character + literally
    \d{1,2} match a digit [0-9]
        Quantifier: {1,2} Between 1 and 2 times, as many times as possible, giving back as needed [greedy]
    \s match any white space character [\r\n\t\f ]
    \( matches the character ( literally
    \d{3,4} match a digit [0-9]
        Quantifier: {3,4} Between 3 and 4 times, as many times as possible, giving back as needed [greedy]
    \) matches the character ) literally
    \s match any white space character [\r\n\t\f ]
    (?:[\d- ]+\d) Non-capturing group
        [\d- ]+ match a single character present in the list below
            Quantifier: + Between one and unlimited times, as many times as possible, giving back as needed [greedy]
            \d match a digit [0-9]
            - a single character in the list - literally
        \d match a digit [0-9]
    (?=\s) Positive Lookahead - Assert that the regex below can be matched
        \s match any white space character [\r\n\t\f ]

\s+ match any white space character [\r\n\t\f ]

    Quantifier: + Between one and unlimited times, as many times as possible, giving back as needed [greedy]

(?<email>.*?@.*?) Named capturing group email

    .*? matches any character (except newline)
        Quantifier: *? Between zero and unlimited times, as few times as possible, expanding as needed [lazy]
    @ matches the character @ literally
    .*? matches any character (except newline)
        Quantifier: *? Between zero and unlimited times, as few times as possible, expanding as needed [lazy]

(?=[\s;,]|$) Positive Lookahead - Assert that the regex below can be matched

    1st Alternative: [\s;,]
        [\s;,] match a single character present in the list below
            \s match any white space character [\r\n\t\f ]
            ;, a single character in the list ;, literally
    2nd Alternative: $
        $ assert position at end of a line

.*? matches any character (except newline)

    Quantifier: *? Between zero and unlimited times, as few times as possible, expanding as needed [lazy]

\s+ match any white space character [\r\n\t\f ]

    Quantifier: + Between one and unlimited times, as many times as possible, giving back as needed [greedy]

(?<email2>[\w]*?@.*?)? Named capturing group email2

    Quantifier: ? Between zero and one time, as many times as possible, giving back as needed [greedy]
    Note: A repeated capturing group will only capture the last iteration. Put a capturing group around the repeated group to capture all iterations or use a non-capturing group instead if you're not interested in the data
    [\w]*? match a single character present in the list below
        Quantifier: *? Between zero and unlimited times, as few times as possible, expanding as needed [lazy]
        \w match any word character [a-zA-Z0-9_]
    @ matches the character @ literally
    .*? matches any character (except newline)
        Quantifier: *? Between zero and unlimited times, as few times as possible, expanding as needed [lazy]

(?=[\s;,]|$) Positive Lookahead - Assert that the regex below can be matched

    1st Alternative: [\s;,]
        [\s;,] match a single character present in the list below
            \s match any white space character [\r\n\t\f ]
            ;, a single character in the list ;, literally
    2nd Alternative: $
        $ assert position at end of a line

g modifier: global. All matches (don't return on first match)
m modifier: multi-line. Causes ^ and $ to match the begin/end of each line (not only begin/end of string)
i modifier: insensitive. Case insensitive match (ignores case of [a-zA-Z])
x modifier: extended. Spaces and text after a # in the pattern are ignored

答案 3 :(得分:0)

这是我的解决方案(在regex101上):

\+\d+\s+\(\d+\)\s+[- \d]+(?= )

确保最后一组空格,数字和/或短划线([- \d]+)后面跟一个空格((?= ))。

它干净地捕获了所有示例,没有尾随空格,也没有包含电子邮件地址的任何部分。

答案 4 :(得分:0)

我甚至不会尝试解析电话号码。

您有一个电话号码,用一个或多个电子邮件地址中的空格字符分隔,以逗号或分号分隔。电子邮件地址始终包含@。

找到第一个@。如果没有,则电话号码是修剪后的字符串。如果有@,则找到@之前的最后一个空格。电话号码是那个空间的一切,修剪。如果@之前没有空格,那么您没有电话号码。

删除电话号码后,您可以通过将字符串拆分为“,”或“;”,修剪字符串,丢弃不包含@的内容来查找电子邮件。

然后找一个合适的号码来处理电话号码,如果你需要这样做,除了记录电话号码。