Coursera课程-Python作业1中的数据科学导论

时间:2020-10-19 10:04:37

标签: python regex

我正在Coursera上这门课程,并且在做第一次作业时遇到了一些问题。任务是基本上使用正则表达式从给定文件中获取某些值。然后,该函数应输出包含以下值的字典:

example_dict = {"host":"146.204.224.152", 

                "user_name":"feest6811", 

                "time":"21/Jun/2019:15:45:24 -0700",

                "request":"POST /incentivize HTTP/1.1"} 

这只是文件的屏幕截图。由于某些原因,如果未直接从Coursera打开链接,则该链接将不起作用。对于格式错误,我事先表示歉意。我必须指出的一件事是,在某些情况下(如第一个示例所示),没有用户名。而是使用“-”。

159.253.153.40 - - [21/Jun/2019:15:46:10 -0700] "POST /e-business HTTP/1.0" 504 19845
136.195.158.6 - feeney9464 [21/Jun/2019:15:46:11 -0700] "HEAD /open-source/markets HTTP/2.0" 204 21149 

这是我目前所拥有的。但是,输出为“无”。我想我的模式有问题。

import re
def logs():
    
    with open("assets/logdata.txt", "r") as file:
        logdata = file.read()
    # YOUR CODE HERE
        
        pattern = """ 
        (?P<host>\w*)
        (\d+\.\d+.\d+.\d+\ )
        (?P<user_name>\w*)
        (\ -\ [a-z]+[0-9]+\ )
        (?P<time>\w*)
        (\[(.*?)\])
        (?P<request>\w*)
        (".*")
        """
        for item in re.finditer(pattern,logdata,re.VERBOSE):
       
            print(item.groupdict())

2 个答案:

答案 0 :(得分:2)

您可以使用以下表达式:

(?P<host>\d+(?:\.\d+){3}) # 1+ digits and 3 occurrenses of . and 3 digits
\s+\S+\s+                 # 1+ whitespaces, 1+ non-whitespaces, 1+ whitespaces
(?P<user_name>\S+)\s+\[   # 1+ non-whitespaces (Group "user_name"), 1+ whitespaces and [
(?P<time>[^\]\[]*)\]\s+   # Group "time": 0+ chars other than [ and ], ], 1+ whitespaces
"(?P<request>[^"]*)"      # ", Group "request": 0+ non-" chars, "

请参见regex demo。 参见Python demo

import re
logdata = r"""159.253.153.40 - - [21/Jun/2019:15:46:10 -0700] "POST /e-business HTTP/1.0" 504 19845
136.195.158.6 - feeney9464 [21/Jun/2019:15:46:11 -0700] "HEAD /open-source/markets HTTP/2.0" 204 21149"""
pattern = r'''
(?P<host>\d+(?:\.\d+){3}) # 1+ digits and 3 occurrenses of . and 3 digits
\s+\S+\s+                 # 1+ whitespaces, 1+ non-whitespaces, 1+ whitespaces
(?P<user_name>\S+)\s+\[   # 1+ non-whitespaces (Group "user_name"), 1+ whitespaces and [
(?P<time>[^\]\[]*)\]\s+   # Group "time": 0+ chars other than [ and ], ], 1+ whitespaces
"(?P<request>[^"]*)"      # ", Group "request": 0+ non-" chars, "
'''
for item in re.finditer(pattern,logdata,re.VERBOSE):
    print(item.groupdict())

输出:

{'host': '159.253.153.40', 'user_name': '-', 'time': '21/Jun/2019:15:46:10 -0700', 'request': 'POST /e-business HTTP/1.0'}
{'host': '136.195.158.6', 'user_name': 'feeney9464', 'time': '21/Jun/2019:15:46:11 -0700', 'request': 'HEAD /open-source/markets HTTP/2.0'}

答案 1 :(得分:1)

import re
def names():
    simple_string = """Amy is 5 years old, and her sister Mary is 2 years old. Ruth and Peter, their parents, have 3 kids."""

    # YOUR CODE HERE
    p=re.findall('[A-Z][a-z]*',simple_string)
    return p

    #raise NotImplementedError()

使用以下代码检查:

assert len(names()) == 4, "There are four names in the simple_string"

有关正则表达式的更多信息,请阅读以下文档,这对初学者非常有用:https://docs.python.org/3/library/re.html#module-re