使用libpst将Outlook PST转换为json

时间:2016-06-30 11:40:06

标签: python ruby json email pst

我有一个Outlook PST文件,我想收到一封json的电子邮件,例如

之类的东西
{"emails": [
{"from": "alice@example.com",
 "to": "bob@example.com",
 "bcc": "eve@example.com",
 "subject": "mitm",
 "content": "be careful!"
}, ...]}

我曾想过使用readpst转换为MH格式,然后在ruby / python / bash脚本中扫描它,有更好的方法吗?

不幸的是,ruby-msg gem不能处理我的PST文件(看起来自2014年以来它没有更新)。

1 个答案:

答案 0 :(得分:2)

我找到了一种方法,分为两个阶段,首先转换为mbox,然后转换为json:

# requires installing libpst
pst2json my.pst
# or you can specify a custom output dir and an outlook mail folder,
# e.g. Inbox, Sent, etc.
pst2json -o email/ -f Inbox my.pst

pst2json是我的脚本,mbox2json稍微修改了Mining the Social Web

pst2json

#!/usr/bin/env bash

usage(){
    echo "usage: $(basename $0) [-o <output-dir>] [-f <folder>] <pst-file>"
    echo "default output-dir: email/mbox-all/<pst-file>"
    echo "default folder: Inbox"
    exit 1
}

which readpst || { echo "Error: libpst not installed"; exit 1; }
folder=Inbox

while (( $# > 0 )); do
    [[ -n "$pst_file" ]] && usage
    case "$1" in
        -o)
            if [[ -n "$2" ]]; then
                out_dir="$2"
                shift 2
            else
                usage
            fi
            ;;
        -f)
            if [[ -n "$2" ]]; then
                folder="$2"
                shift 2
            else
                usage
            fi
            ;;
        *)
            pst_file="$1"
            shift
    esac
done

default_out_dir="email/mbox-all/$(basename $pst_file)"
out_dir=${out_dir:-"$default_out_dir"}
mkdir -p "$out_dir"
readpst -o "$out_dir" "$pst_file"
[[ -f "$out_dir/$folder" ]] || { echo "Error: folder $folder is missing or empty."; exit 1; }
res="$out_dir"/"$folder".json
mbox2json "$out_dir/$folder" "$res" && echo "Success: result saved to $res"

mbox2json(python 2.7):

# -*- coding: utf-8 -*-

import sys
import mailbox
import email
import quopri
import json
from BeautifulSoup import BeautifulSoup

MBOX = sys.argv[1]
OUT_FILE = sys.argv[2]
SKIP_HTML=True

def cleanContent(msg):

    # Decode message from "quoted printable" format

    msg = quopri.decodestring(msg)

    # Strip out HTML tags, if any are present

    soup = BeautifulSoup(msg)
    return ''.join(soup.findAll(text=True))


def jsonifyMessage(msg):
    json_msg = {'parts': []}
    for (k, v) in msg.items():
        json_msg[k] = v.decode('utf-8', 'ignore')

    # The To, CC, and Bcc fields, if present, could have multiple items
    # Note that not all of these fields are necessarily defined

    for k in ['To', 'Cc', 'Bcc']:
        if not json_msg.get(k):
            continue
        json_msg[k] = json_msg[k].replace('\n', '').replace('\t', '').replace('\r'
                , '').replace(' ', '').decode('utf-8', 'ignore').split(',')

    try:
        for part in msg.walk():
            json_part = {}
            if part.get_content_maintype() == 'multipart':
                continue
            type = part.get_content_type()
            if SKIP_HTML and type == 'text/html':
                continue
            json_part['contentType'] = type
            content = part.get_payload(decode=False).decode('utf-8', 'ignore')
            json_part['content'] = cleanContent(content)

            json_msg['parts'].append(json_part)
    except Exception, e:
        sys.stderr.write('Skipping message - error encountered (%s)\n' % (str(e), ))
    finally:
        return json_msg

# There's a lot of data to process, so use a generator to do it. See http://wiki.python.org/moin/Generators
# Using a generator requires a trivial custom encoder be passed to json for serialization of objects
class Encoder(json.JSONEncoder):
    def default(self, o):
        return {'emails': list(o)}


# The generator itself...
def gen_json_msgs(mb):
    while 1:
        msg = mb.next()
        if msg is None:
            break
        yield jsonifyMessage(msg)

mbox = mailbox.UnixMailbox(open(MBOX, 'rb'), email.message_from_file)
json.dump(gen_json_msgs(mbox),open(OUT_FILE, 'wb'), indent=4, cls=Encoder)

现在,可以轻松处理文件。例如。只获取电子邮件的内容:

jq '.emails[] | .parts[] | .content' < out/Inbox.json