从任意文件中删除电子邮件地址

时间:2013-04-30 16:55:02

标签: regex sed awk

从大型文件集中获取user@host.com组合的最佳方法是什么?

我认为sed / awk可以做到这一点,但我对regexp并不是很熟悉。

我们有一个文件,即Staff_data.txt,其中不仅包含电子邮件,还想删除其余数据,只收集电子邮件地址(即h@south.com)

我认为最简单的方法是通过终端中的sed / awk,但看看regexp有多复杂,我会很感激一些指导。

感谢。

2 个答案:

答案 0 :(得分:0)

您希望此处grep不是sedawk。例如,要显示来自域south.com的所有电子邮件:

grep -o '[^ ]*@south\.com ' file

答案 1 :(得分:0)

这是几年前我写的一个有点令人尴尬但显然有效的脚本来完成这项工作:

# Get rid of any Message-Id line like this:
#   Message-ID: <AANLkTinSDG_dySv_oy_7jWBD=QWiHUMpSEFtE-cxP6Y1@mail.gmail.com>
#
# Change any character that can't be in an email address to a space.
#
# Print just the character strings that look like email addresses.
#
# Drop anything with multple "@"s and change any domain names (i.e.
# the part after the "@") to all lower case as those are not case-sensitive.
#
# If we have a local mail box part (i.e. the part before the "@") that's
# a mix of upper/lower and another that's all lower, keep them both. Ditto
# for multiple versions of mixed case since we don't know which is correct.
#
# Sort uniquely.

cat "$@" |
awk '!/^Message-ID:/' |
awk '{gsub(/[^-_.@[:alnum:]]+/," ")}1' |
awk '{for (i=1;i<=NF;i++) if ($i ~ /.+@.+[.][[:alpha:]]+$/) print $i}' |
awk '
  BEGIN   { FS=OFS="@" }
  NF != 2 { printf "Badly formatted %s skipped.\n",$0 | "cat>&2"; next }
  { $2=tolower($2); print }
' |
tr '[A-Z]' '[a-z]' |
sort -u

它不漂亮,但看起来很健壮。