我尝试从SMTP邮件中提取文本信息,即:
以下是输入示例:
sudo sed -i '/nameserver.*/d' /etc/resolv.conf
sudo tee -a /etc/resolv.conf >/dev/null <<'EOF'
nameserver 10.1.1.1
nameserver 10.1.1.2
nameserver 10.1.1.3
EOF
该示例被故意截断,因为标头更长,但这是用于示例
我尝试了sed和awk,但我设法做了一些事情,但没有做到我想要的。
SED:
Delivered-To: SOME@ADDRESS.COM
Received: by X.X.X.X with SMTP id SOMEID;
Wed, 9 Oct 2019 01:55:58 -0700 (PDT)
X-Received: by X.X.X.X with SMTP id SOMEID;
Wed, 09 Oct 2019 01:55:58 -0700 (PDT)
Return-Path: <SOME@ADDRESS.COM>
Received: from SOME.URL.COM (SOME.OTHER.URL.COM. [X.X.X.X])
by SOME.THIRD.URL.COM with ESMTP id SOMEID
for <SOME@ADDRESS.COM>;
Wed, 09 Oct 2019 01:55:58 -0700 (PDT)
SOME_HTML
SOME_HTML
href="http://URL1"><img
SOME_HTML
src="http://URL2"
SOME_HTML
第一个是将URL留置一留置权,但此后我没有设法使用它。 而且无论如何,这是不正常的。
AWK:
sed -e 's/http/\nhttp/g' -n -e '/Received: from/{h;n;n;n;H;x;s/\n \+/;/;p}' a.txt
除URL之外,此方法有效。 rexgexp在单行中无效,是的。 我还尝试通过使用NR + 3的值来找到一种更优雅的约会方式,但这没用。
并以csv格式显示:
date; sender; URL1; URL2; ...
我更喜欢纯sed或纯awk,因为我认为我可以使用grep,tail,sed和awk来做到这一点,但是我想学习的时候,我更喜欢它们中的一个或两个:)
答案 0 :(得分:0)
好吧,以下冗长的sed脚本内部带有注释:
sed -nE '
/Received: from /{
# hold mu line!
h
# ach, spagetti code, here we go again
: notdate
${
s/.*/ERROR: INVALID INPUT: DATE NOT FOUND/
p
q1
}
# the next line after the line ending with ; should be the date
/;$/!{
# so search for a line ending with ;
n
b notdate
}
# the next line is the date
n
# remove leading spaces
s/^[[:space:]]*//
# grab the Received: from line
G
# and save it for later
h
}
# headers end with an empty line
/^$/{
# loop over lines
: read_next_line
n
# flag with \x01<URL>\x02 all occurences of URLs
s/"(http[^"]+)"/\x01\1\x02/g
# we found at least one URL if there is \x01 in the pattern space
/\x01/{
# extract each occurence to the end of pattern space with a newline
: again
s/^([^\x01]*)\x01([^\x02]*)\x02(.*)$/\1\3\n\2/
t again
# remove everything in front of separator - the unparsed part of line
s/^[^\n]*\n//
# add URLs to hold space
H
}
# if this is the last line, we should finally print something!, and, exit
${
# grab the hold space
x
# replace the separator for a ;
s/\n/;/g
# print and exit successfully
p
q 0
}
# here we go again!
b read_next_line
}
'
用于以下输入:
Delivered-To: SOME@ADDRESS.COM
Received: by X.X.X.X with SMTP id SOMEID;
Wed, 9 Oct 2019 01:55:58 -0700 (PDT)
X-Received: by X.X.X.X with SMTP id SOMEID;
Wed, 09 Oct 2019 01:55:58 -0700 (PDT)
Return-Path: <SOME@ADDRESS.COM>
Received: from SOME.URL.COM (SOME.OTHER.URL.COM. [X.X.X.X])
by SOME.THIRD.URL.COM with ESMTP id SOMEID
for <SOME@ADDRESS.COM>;
Wed, 09 Oct 2019 01:55:58 -0700 (PDT)
SOME_HTML
SOME_HTML
href="http://URL1"><img
SOME_HTML
src="http://URL2"
SOME_HTML
SOMEHTML src="http://URL3" SOMEHTML src="http://URL4"
输出:
Wed, 09 Oct 2019 01:55:58 -0700 (PDT);Received: from SOME.URL.COM (SOME.OTHER.URL.COM. [X.X.X.X]);http://URL1;http://URL2;http://URL3;http://URL4