使用正则表达式查找文本值

时间:2019-05-26 13:17:28

标签: python regex

我需要在日志文件中隔离一个单词并提取以下值。 我一直在阅读正则表达式,但似乎无法理解语法。

我正在从日志文件中读取并收集我需要使用诸如re.findall之类的东西。

我这样做是bash,但无法将其转换为python。

现金代码:

cat FILE | sed -n -e 's/^.*GET //p' | sed -e 's/,.*//g' |sort | uniq -c | sort -n

日志文件摘要:

109.40.2.10 - - [12/May/2019:06:53:40 +0200] "GET /ddo/livesearch?text=tilkn&format=json&app=android&size=30 HTTP/1.1" 200 96 "-" "Apache-HttpClient/UNAVAILABLE (java 1.4)"
109.40.2.10 - - [12/May/2019:06:53:41 +0200] "GET /ddo/livesearch?text=tilk&format=json&app=android&size=30 HTTP/1.1" 200 464 "-" "Apache-HttpClient/UNAVAILABLE (java 1.4)"
109.40.2.10 - - [12/May/2019:06:53:41 +0200] "GET /ddo/livesearch?text=ti&format=json&app=android&size=30 HTTP/1.1" 200 401 "-" "Apache-HttpClient/UNAVAILABLE (java 1.4)"
109.40.2.10 - - [12/May/2019:06:53:41 +0200] "GET /ddo/livesearch?text=t&format=json&app=android&size=30 HTTP/1.1" 200 12 "-" "Apache-HttpClient/UNAVAILABLE (java 1.4)"
109.40.2.10 - - [12/May/2019:06:53:42 +0200] "GET /ddo/livesearch?text=&format=json&app=android&size=30 HTTP/1.1" 200 12 "-" "Apache-HttpClient/UNAVAILABLE (java 1.4)"
109.40.2.10 - - [12/May/2019:06:53:43 +0200] "GET /ddo/livesearch?text=b&format=json&app=android&size=30 HTTP/1.1" 200 12 "-" "Apache-HttpClient/UNAVAILABLE (java 1.4)"

我需要提取的内容: / ddo / *行

1 个答案:

答案 0 :(得分:1)

使用re.search-> lookahead & lookbehind

例如:

import re

s = '''109.40.2.10 - - [12/May/2019:06:53:40 +0200] "GET /ddo/livesearch?text=tilkn&format=json&app=android&size=30 HTTP/1.1" 200 96 "-" "Apache-HttpClient/UNAVAILABLE (java 1.4)"
109.40.2.10 - - [12/May/2019:06:53:41 +0200] "GET /ddo/livesearch?text=tilk&format=json&app=android&size=30 HTTP/1.1" 200 464 "-" "Apache-HttpClient/UNAVAILABLE (java 1.4)"
109.40.2.10 - - [12/May/2019:06:53:41 +0200] "GET /ddo/livesearch?text=ti&format=json&app=android&size=30 HTTP/1.1" 200 401 "-" "Apache-HttpClient/UNAVAILABLE (java 1.4)"
109.40.2.10 - - [12/May/2019:06:53:41 +0200] "GET /ddo/livesearch?text=t&format=json&app=android&size=30 HTTP/1.1" 200 12 "-" "Apache-HttpClient/UNAVAILABLE (java 1.4)"
109.40.2.10 - - [12/May/2019:06:53:42 +0200] "GET /ddo/livesearch?text=&format=json&app=android&size=30 HTTP/1.1" 200 12 "-" "Apache-HttpClient/UNAVAILABLE (java 1.4)"
109.40.2.10 - - [12/May/2019:06:53:43 +0200] "GET /ddo/livesearch?text=b&format=json&app=android&size=30 HTTP/1.1" 200 12 "-" "Apache-HttpClient/UNAVAILABLE (java 1.4)"'''

for line in s.splitlines():
    m = re.search(r'(?<="GET )(?P<path>.*?)(?=HTTP/1.1")', line)
    if m:
        print(m.group("path"))

输出:

/ddo/livesearch?text=tilkn&format=json&app=android&size=30 
/ddo/livesearch?text=tilk&format=json&app=android&size=30 
/ddo/livesearch?text=ti&format=json&app=android&size=30 
/ddo/livesearch?text=t&format=json&app=android&size=30 
/ddo/livesearch?text=&format=json&app=android&size=30 
/ddo/livesearch?text=b&format=json&app=android&size=30