Question

我是一个尝试使用Python来分析公司日志文件的新手。它们具有不同的格式，因此在线日志分析器不能很好地工作。

格式如下：

localtime time-taken x-cs-dns c-ip sc-status s-action sc-bytes
cs-bytes cs-method cs-uri-scheme cs-host cs-uri-port cs-uri-path
cs-uri-query cs-username cs-auth-group s-hierarchy s-supplier-name
rs(Content-Type) cs(Referer) cs(User-Agent) sc-filter-result
cs-categories x-virus-id s-ip

示例：

"[27/Feb/2012:06:00:01 +0900]" 65 10.184.17.23 10.184.17.23 200
TCP_NC_MISS 99964 255 GET http://thumbnail.image.example.com 80
/mall/shop/cabinets/duelmaster/image01.jpg - - -
DIRECT thumbnail.image.example.com image/jpeg - "Wget/1.12
(linux-gnu)" OBSERVED "RC_White_list;KC_White_list;Shopping" -
10.0.201.17

我现在要做的主要事情是获取所有cs-host和cs-uri-path字段，将它们连接在一起（在上面的示例中获取http://thumbnail.image.example.com/mall/shop/cabinets/duelmaster/image01.jpg），计算唯一实例，并根据访问次数对它们进行排名和吐出，以查看顶部网址。有没有办法让Python像处理单独的对象/列一样处理空白并获取第11个对象？例如？

另一个复杂因素是我们的日常日志文件是巨大的（~15GB），理想情况下我希望在可能的情况下花费20分钟。

Niklas B.的代码运行良好，我可以打印顶级IP，用户等。

不幸的是，我无法将程序打印或写入外部文件或电子邮件。目前我的代码看起来像这样，只有最后一行被写入文件。可能是什么问题？

for ip，count in heapq.nlargest（k，sourceip.iteritems（），key = itemgetter（1））：         top =“％d％s”％（count，ip）v =          打开（“C：/ Users / guest / Desktop / Log analysis / urls.txt”，“w”）
          打印＆gt;＆gt; v，顶部

Answer 1

是：

from collections import defaultdict
from operator import itemgetter

access = defaultdict(int)

with open("/path/to/file.log", "wb") as f:
  for line in f:
    parts = line.split() # split at whitespace
    access[parts[11] + parts[13]] += 1 # adapt indices here

# print all URLs in descending order
for url, count in sorted(access.iteritems(), key=lambda (_, c): -c):
  print "%d %s" % (count url)

# if you only want to see the top k entries:
import heapq
k = 10
for url, count in heapq.nlargest(k, access.iteritems(), key=itemgetter(1)):
  print "%d %s" % (count, url)

未测试。另一种可能性是使用Counter：

from collections import Counter
with open("/path/to/file.log", "wb") as f:
  counter = Counter(''.join(line.split()[11:14:2]) for line in f)

# print top 10 (leave argument out to list all)
for url, count in counter.most_common(10):
  print "%d %s" % (count, url)

顺便说一下，将URL写入文件的代码问题是您在每次迭代中重新打开文件，因此每次都丢弃文件的内容。你应该在循环外打开文件，只在里面写。

使用Python进行日志分析（访问的最佳URL）

1 个答案: