Question

假设我们有一些这样的访问日志

83.198.250.175--[22 / Mar / 2009：07：40：06 +0100]“ GET /images/ht1.gif HTTP / 1.1” 200 61“ http://www.facades.fr/ “” Mozilla / 4.0（兼容； MSIE 7.0； Windows NT 5.1； Wanadoo 6.7； Orange 8.0）“”-“

65.33.94.190--[05 / Apr / 2003：17：26：27 -0500]“ POST /samples/dem/tt.php ？x = e2323 HTTP / 1.0” 404 276

151.227.152.48--[02 / Jul / 2014：14：35：55 +0100]“ GET /css/main.css HTTP / 1.1” 200 4658“ http://stanmore.menczykowski.co.uk/ “” Mozilla / 5.0（Macintosh; Intel Mac OS X 10_9_3）AppleWebKit / 537.36（KHTML，例如Gecko）Chrome / 35.0.1916.153 Safari / 537.36“

10.143.2.119 64.103.161.112-[06 / Jan / 1970：00：48：01 +0000]“ GET /right_arrow.jpg HTTP / 1.1” 304 0“ http://64.103.161.112/index_eth_diag.html “” Mozilla / 5.0（Windows NT 6.1; WOW64）AppleWebKit / 537.36（KHTML，例如Gecko）Chrome / 28.0.1500.95 Safari / 537.36“

我需要在 POST 和 GET （文件路径）之后获取加粗的文本部分。
日志格式可能会有所不同，但请求类型和路径将始终存在。

我尝试使用以下方法，但是由于日志格式不相同，所以它并不总是有效

parts = [
    r'(?P<host>\S+)',                   # host %h
    r'\S+',                             # indent %l (unused)
    r'(?P<user>\S+)',                   # user %u
    r'\[(?P<time>.+)\]',                # time %t
    r'"(?P<request>.*)"',               # request "%r"
    r'(?P<status>[0-9]+)',              # status %>s
    r'(?P<size>\S+)',                   # size %b (careful, can be '-')
    r'"(?P<referrer>.*)"',              # referrer "%{Referer}i"
    r'"(?P<agent>.*)"',                 # user agent "%{User-agent}i"
]

def get_structured_access_logs_list(access_logs):
    pattern = re.compile(r'\s+'.join(parts) + r'\s*\Z')

    # Initialize required variables
    log_data = []

    # Get components from each line of the log file into a structured dict
    for line in access_logs:
        try:
            log_data.append(pattern.match(line).groupdict())
        except:
            pass
    return log_data

def parse_path(request_string) :
    rx = re.compile(r'^(?:GET|POST)\s+([^?\s]+).*$', re.M)
    return rx.findall(request_string)


def get_file_paths(access_logs_list):
    file_path_set = set()
    for dict in access_logs_list:
        if 'request' in dict.keys():
            file_name = parse_path(dict['request'])[0] # passing a single line, the list will contain only 1 element
            if file_name is not None:
                file_path_set.add(full_path)
    return accessed_file_set

更新：调整代码后，函数“ get_file_paths”将返回一个包含在访问日志中访问的文件的完整路径的集合

def parse_path(request_string) :
    rx = re.compile(r'"(?:GET|POST)\s+([^\s?]*)', re.M)
    return rx.findall(request_string)


def get_file_paths(access_logs):
    file_set = set()
    for line in access_logs:
            matches = parse_accessed_file_name_list(line) # passing a single line, the list will contain only 1 element
            if matches is None or len(matches) <= 0:
                continue
            full_path = root_path + matches[0]
            if os.path.isfile(full_path):
                file_set.add(full_path)
    return file_set

Answer 1

您可以使用

(?x)^
    (?P<host>\S+)                         \s+         # host %h
    \S+                                   \s+         # indent %l (unused)
    (?P<user>\S+)                         \s+         # user %u
    \[(?P<time>.*?)\]                     \s+         # time %t
    "\S+\s+(?P<request>[^"?\s]*)[^"]*"    \s+         # request "%r"
    (?P<status>[0-9]+)                    \s+         # status %>s
    (?P<size>\S+)                      (?:\s+         # size %b (careful, can be '-')
    "(?P<referrer>[^"?\s]*[^"]*)"         \s+         # referrer "%{Referer}i"
    "(?P<agent>[^"]*)"                 (?:\s+         # user agent "%{User-agent}i"
    "[^"]*"                            )? )?          # unused
$

请参见regex demo。

我引入了许多次要改进（请参见[^"]*而不是.*），主要改进是可选的非捕获组，以匹配引荐来源网址和代理字段可能会丢失，并且request模式看起来像(?P<request>[^"?\s]*)，并且仅捕获0个或多个除空格，?和"字符以外的字符，而随后的{{1 }}与该字段的其余部分匹配。

此外，有意义的是一次编译模式，而不是像处理每一行时那样进行编译。

[^"]*"修饰符启用自由间距模式，从而可以在多行上设置图案格式并添加注释。

Python demo：

(?x)

Answer 2

您可以使用此正则表达式从group1获取路径，

^.*?"(?:GET|POST) ([^\s?]+)

Demo

Answer 3

由于您的正则表达式非常通用（您使用了\S和.，它们非常广泛），为什么不直接使用：

"(?:GET|POST)\s+([^\s?]*)

[^\s?]匹配所有不是空格也不是问号的字符。

请参见here演示。

Python解析访问日志中的GET | POST路径

3 个答案: