查找,解码和替换文本文件中的所有base64值

时间:2015-11-02 21:25:05

标签: python sql regex sed base64

我有一个SQL转储文件,其中包含带有html链接的文本,如:

<a href="http://blahblah.org/kb/getattachment.php?data=NHxUb3Bjb25fZGF0YS1kb3dubG9hZF9ob3d0by5wZGY=">attached file</a>

我想在每个链接中查找,解码和替换文本的base64部分。

我一直在尝试使用带有正则表达式和base64的Python来完成这项工作。但是,我的正则表达式技能无法胜任。

我需要选择以

开头的任何字符串
'getattachement.php?data=' 

结束
'"'

然后,我需要解码' data ='之间的部分。和'& quot'使用base64.b64decode()

结果应该类似于:

<a href="http://blahblah.org/kb/4/Topcon_data-download_howto.pdf">attached file</a>

我认为解决方案看起来像:

import re
import base64
with open('phpkb_articles.sql') as f:
    for line in f:
        re.sub(some_regex_expression_here, some_function_here_to_decode_base64)

有什么想法吗?

编辑:回答任何有兴趣的人。

import re
import base64
import sys


def decode_base64(s):
    """
    Method to decode base64 into ascii
    """
    # fix escaped equal signs in some base64 strings
    base64_string = re.sub('%3D', '=', s.group(1))
    decodedString = base64.b64decode(base64_string)

    # substitute '|' for '/'
    decodedString = re.sub('\|', '/', decodedString)

    # escape the spaces in file names
    decodedString = re.sub(' ', '%20', decodedString)

    # print 'assets/' + decodedString + '&quot'  # Print for debug
    return 'assets/' + decodedString + '&quot'


count = 0

pattern = r'getattachment.php\?data=([^&]+?)&quot'

# Open the file and read line by line
with open('phpkb_articles.sql') as f:
    for line in f:
        try:
            # globally substitute in new file path
            edited_line = re.sub(pattern, decode_base64, line)
            # output the edited line to standard out
            sys.stdout.write(edited_line)
        except TypeError:
            # output unedited line if decoding fails to prevent corruption
            sys.stdout.write(line)
            # print line
            count += 1

1 个答案:

答案 0 :(得分:1)

你已经拥有它,你只需要小块:

模式:r'data=([^&]+?)&quot'将匹配data=之后和&quot

之前的所有内容
>>> pat = r'data=([^&]+?)&quot'
>>> line = '<a href="http://blahblah.org/kb/getattachment.php?data=NHxUb3Bjb25fZGF0YS1kb3dubG9hZF9ob3d0by5wZGY=">attached file</a>'
>>> decodeString = re.search(pat,line).group(1) #because the b64 string is capture by grouping, we only want group(1)
>>> decodeString
'NHxUb3Bjb25fZGF0YS1kb3dubG9hZF9ob3d0by5wZGY='

然后,您可以使用str.replace()方法以及base64.b64decode()方法完成剩下的工作。我不想只为你编写代码,但这应该让你知道去哪里。