这将打印列表

Question

我有一个非常大的.txt文件，其中包含数十万个电子邮件地址。它们都采用以下格式：

...<name@domain.com>...

让Python在整个.txt文件中循环查找某个@domain字符串的所有实例，然后在＆lt; ...＆gt;中获取整个地址的最佳方法是什么？，并将其添加到列表中？我遇到的麻烦是不同地址的可变长度。

Answer 1

此code以字符串形式提取电子邮件地址。逐行阅读时使用它

>>> import re
>>> line = "should we use regex more often? let me know at  321dsasdsa@dasdsa.com.lol"
>>> match = re.search(r'[\w\.-]+@[\w\.-]+', line)
>>> match.group(0)
'321dsasdsa@dasdsa.com.lol'

如果您有多个电子邮件地址，请使用findall：

>>> line = "should we use regex more often? let me know at  321dsasdsa@dasdsa.com.lol"
>>> match = re.findall(r'[\w\.-]+@[\w\.-]+', line)
>>> match
['321dsasdsa@dasdsa.com.lol', 'dadaads@dsdds.com']

上面的正则表达式可能会找到最常见的非假冒电子邮件地址。如果您希望与RFC 5322完全一致，则应检查遵循规范的电子邮件地址。检查this以避免在正确查找电子邮件地址时出现任何错误。

修改：按@kostek的评论中的建议：在字符串Contact us at support@example.com.中，我的正则表达式返回support@example.com。（最后带点）。为避免这种情况，请使用[\w\.,]+@[\w\.,]+\.\w+)

编辑II：评论中提到了另一个很棒的改进：[\w\.-]+@[\w\.-]+\.\w+也将捕获example@do-main.com。

Answer 2

您还可以使用以下内容查找文本中的所有电子邮件地址，并将它们打印在一个阵列中，或者将每封电子邮件打印在一个单独的行中。

import re
line = "why people don't know what regex are? let me know asdfal2@als.com, Users1@gmail.de " \
       "Dariush@dasd-asasdsa.com.lo,Dariush.lastName@someDomain.com"
match = re.findall(r'[\w\.-]+@[\w\.-]+', line)
for i in match:
    print(i)

如果要将其添加到列表中，只需打印“匹配”

这将打印列表

print(match)

希望这有帮助。

Answer 3

如果您正在寻找特定的域名：

>>> import re
>>> text = "this is an email la@test.com, it will be matched, x@y.com will not, and test@test.com will"
>>> match = re.findall(r'[\w-\._\+%]+@test\.com',text) # replace test\.com with the domain you're looking for, adding a backslash before periods
>>> match
['la@test.com', 'test@test.com']

Answer 4

import re
with open("file_name",'r') as f:
    s = f.read()
    result = re.findall(r'\S+@\S+',s)
    for r in result:
        print(r)

Answer 5

这是针对此特定问题的另一种方法，使用来自emailregex.com的正则表达式：

Answer 6

import re
rgx = r'(?:\.?)([\w\-_+#~!$&\'\.]+(?<!\.)(@|[ ]?\(?[ ]?(at|AT)[ ]?\)?[ ]?)(?<!\.)[\w]+[\w\-\.]*\.[a-zA-Z-]{2,3})(?:[^\w])'
matches = re.findall(rgx, text)
get_first_group = lambda y: list(map(lambda x: x[0], y))
emails = get_first_group(matches)

请不要恨我去这个臭名昭著的正则表达式。正则表达式适用于如下所示的电子邮件地址。我主要使用this as my basis作为电子邮件地址中的有效字符。

随意play around with it here

I also made a variation，其中正则表达式捕获name at example.com之类的电子邮件

(?:\.?)([\w\-_+#~!$&\'\.]+(?<!\.)(@|[ ]\(?[ ]?(at|AT)[ ]?\)?[ ])(?<!\.)[\w]+[\w\-\.]*\.[a-zA-Z-]{2,3})(?:[^\w])

Answer 7

import re 
txt = 'hello from absc@gmail.com to par1@yahoo.com about the meeting @2PM'
email  =re.findall('\S+@\S+',s)
print(email)

打印输出：

['absc@gmail.com', 'par1@yahoo.com']

Answer 8

import re
mess = '''Jawadahmed@gmail.com Ahmed@gmail.com
            abc@gmail'''
email = re.compile(r'([\w\.-]+@gmail.com)')
result= email.findall(mess)

if(result != None):
    print(result)

上面的代码将为您提供帮助，并带来Gmail和仅在调用后通过电子邮件发送的电子邮件。

Answer 9

您可以在末尾使用\ b来获取正确的电子邮件，以定义电子邮件的结尾。

正则表达式

[\w\.\-]+@[\w\-\.]+\b

Answer 10

示例：字符串，如果邮件ID包含（a-z均小写，_或0-9的任意数字），则下面为正则表达式：

>>> str1 = "abcdef_12345@gmail.com"
>>> regex1 = "^[a-z0-9]+[\._]?[a-z0-9]+[@]\w+[.]\w{2,3}$"
>>> re_com = re.compile(regex1)
>>> re_match = re_com.search(str1)
>>> re_match
<_sre.SRE_Match object at 0x1063c9ac0>
>>> re_match.group(0)
'abcdef_12345@gmail.com'

Answer 11

content = ' abcdabcd jcopelan@nyx.cs.du.edu  afgh 65882@mimsy.umd.edu  qwertyuiop mangoe@cs.umd'

match_objects = re.findall(r'\w+@\w+[\.\w+]+', content)

Answer 12

#    \b[\w|\.]+   ---> means begins with any english and number character or dot.

import re

marks = '''

!()[]{};?#$%:'"\,/^&é*

'''

text = 'Hello from priyankv@gmail.com to python@gmail.com, datascience@@gmail.com and machinelearning@@yahoo..com wrong email address: farzad@google.commmm'
# list of sequences of characters:
text_pieces = text.split()
pattern = r'\b[a-zA-Z]{1}[\w|\.]*@[\w|\.]+\.[a-zA-Z]{2,3}$'
for p in text_pieces:
  for x in marks:
    p = p.replace(x, "") 
  if len(re.findall(pattern, p)) > 0:
    print(re.findall(pattern, p))

从大型文档中提取电子邮件子字符串

12 个答案:

您还可以使用以下内容查找文本中的所有电子邮件地址，并将它们打印在一个阵列中，或者将每封电子邮件打印在一个单独的行中。

这将打印列表