Question

我需要读取日志文件，提取所有路径，并返回不包含重复项的路径的排序列表。最好的方法是什么？使用set？

我想到了这样的事情：

def geturls(filename)
  f = open(filename)
  s = set() # creates an empty set?

  for line in f:
    # see if the line matches some regex

    if match:
      s.add(match.group(1))

  f.close()

  return sorted(s)

修改

放入集合中的项目是路径字符串，应该由函数作为按字母顺序排序的列表返回。

编辑2 以下是一些示例数据：

10.254.254.28 - - [06 / Aug / 2007：00：12：20 -0700]“GET / keyser / 22300 / HTTP / 1.0“302 528” - “ “Mozilla / 5.0（X11; U; Linux i686 （x86_64的）; EN-US; RV：1.8.1.4） Gecko / 20070515 Firefox / 2.0.0.4“ 10.254.254.58 - - [06 / Aug / 2007：00：10：05 -0700]“GET /edu/languages/google-python-class/images/puzzle/a-baaa.jpg HTTP / 1.0“200 2309” - “ “googlebot-mscrawl-moma（企业; 酒吧-XYZ; foo123 @ google.com，foo123 @ google.com，foo123 @ google.com，foo123 @ google.com）” 10.254.254.28 - - [06 / Aug / 2007：00：11：08 -0700]“GET /favicon.ico HTTP / 1.0“302 3404” - “ “googlebot-mscrawl-moma（企业; 杆-XYZ;

有趣的部分是GET和HTTP之间的URL。也许我应该提到这是练习的一部分，没有现实世界的数据。

Answer 1

def sorted_paths(filename):
    with open(filename) as f:
       gen = (matches(line) for line in f)
       s = set(match.group(1) for match in gen if match)
    return sorted(s)

Answer 2

这是一种很好的方式，无论是在性能方面还是在简洁性方面。

Answer 3

仅当顺序无关紧要（因为集合是无序的），并且类型是可以清除的（哪些字符串是）。

Answer 4

您可以使用词典存储路径。

from collections import defaultdict
h=defaultdict(str)
uniq=[]
for line in open("file"):
    if "pattern" in line:
       # code to extract path here.
       extractedpath= ......
       h[extractedpath.strip()] = "" #using dictionary to store unique values
       if extractedpath not in uniq:
           uniq.append(extractedpath) #using a list to store unique values

Answer 5

只有你应该在任何地方都有完整的路径名，如果你在Windows中，名称可以是各种情况，因为它们不区分大小写。同样在Python中你也可以使用/而不是\（是的：小心转义反斜杠）。

如果您实际处理的是URL，那么大部分时间domain.com，domain.com /，www.domain.com和http://www.domain.com都是一样的，您应该决定如何规范化。

消除重复 - 我应该使用一套？

5 个答案: