因此,我有一个提取电子邮件和电话号码的程序。 我已经运行了,电话号码也很好。但是,电子邮件将继续导致: 例如:3465Usjohnson@astate.eduUP批准而不是sjohnson@astate.edu 从中提取的环绕文本: 870-972-3465Usjohnson@astate.eduUP批准和副总理博士Lynita Cooksey870-972-2 030 870-972-2036Ulcooksey@astate.edu
在实际的PDF中,存在白色和间距,但是在复制和粘贴时,它们之间没有空格,因此也没有我收到的电子邮件。(它看起来像:enter image description here
#! python 3
import re, pyperclip
# Regex for phone numbers
phoneRegex = re.compile(r'''
# 860-555-3951, 555-3951, (860) 555-3951, 555-3951 ext 12345, ext. 12345, x12345
(
((\d\d\d)|(\(\d\d\d\)))? #area code (optional)
(\s|-) #first seperator
\d\d\d #first 3 digits
- #second seperator
\d\d\d\d #last 4 digits
(((ext(\.)?\s)|x) #Extension-words (optional)
(\d{2,5}))? #Extension - numbers (optional)
)
''', re.VERBOSE)
#Regex for Emails
emailRegex = re.compile(r'''
#some._+thing@(/d{2,5}))?.com
[a-zA-Z0-9_.+]+ #Name part
@ #@ symbol
[a-zA-Z0-9_.+]+ #domain
''', re.VERBOSE)
#pyperclip get text off
text = pyperclip.paste()
#extract
extractedPhone = phoneRegex.findall(text)
extractedEmail = emailRegex.findall(text)
allPhoneNumbers = []
for phoneNumber in extractedPhone:
allPhoneNumbers.append(phoneNumber[0])
#copy to clipboard
results = '\n'.join(allPhoneNumbers) + '\n'.join(extractedEmail)
pyperclip.copy(results)
答案 0 :(得分:0)
因此,因为我没有您的原始文本,所以我仅使用示例中的字符串。
查看以下两个正则表达式是否适合您。我还提供了第三点,这更加精确。
'(?<=\dU)[\w]+@[\w\.]+?(?=U|\s|$)'
。
'(?<=\dU)[\w]+@[\w]+\.[\w]+?(?=U|\s|$)'
。
示例测试
>>> import re
>>> string = '''3465Usjohnson@astate.eduUProvost instead of sjohnson@astate.edu The surround text that it is being extracted from: 870-972-3465Usjohnson@astate.eduUProvost and Vice ChancellorDr. Lynita Cooksey870-972-2 030 870-972-2036Ulcooksey@astate.edu'''
>>> re.findall('(?<=\dU)[\w]+@[\w\.]+?(?=U|\s|$)', string)
#Output
['sjohnson@astate.edu', 'sjohnson@astate.edu', 'lcooksey@astate.edu']
>>> re.findall('(?<=\dU)[\w]+@[\w]+\.[\w]+?(?=U|\s|$)', string)
#Output
['sjohnson@astate.edu', 'sjohnson@astate.edu', 'lcooksey@astate.edu']
。
更加精确一点,因为所有电子邮件均以.edu
'(?<=\dU)[\w]+@[\w]*\.edu(?=U|\s|$)'
。
示例测试
>>> string = '''3465Usjohnson@astate.eduUProvost instead of sjohnson@astate.edu The surround text that it is being extracted from: 870-972-3465Usjohnson@astate.eduUProvost and Vice ChancellorDr. Lynita Cooksey870-972-2 030 870-972-2036Ulcooksey@astate.edu'''
>>> re.findall('(?<=\dU)[\w]+@[\w]*\.edu(?=U|\s|$)', string)
#Output
['sjohnson@astate.edu', 'sjohnson@astate.edu', 'lcooksey@astate.edu']
答案 1 :(得分:0)
我本人是Python的新手。如果文字是专门从“ astate.edu ”网站中提取的,我认为您可以使用此正则表达式:
text='70-972-3465Usjohnson@astate.eduUProvost and Vice ChancellorDr. Lynita Cooksey870-972-2 030 870-972-2036Ulcooksey@astate.edu'
import re
email= re.findall('[a-z]+\@\w+\.edu',text)
#output
['sjohnson@astate.edu', 'lcooksey@astate.edu']
祝你好运!