Question

这可能是一个简单的问题。但是我浪费了很多时间而没有弄清楚这里发生了什么。我想基于资源扩展在Web日志文件中对HTTP请求进行分类。以下是我的尝试。

imgstr = ['.png','.gif','.jpeg','.jpg']     
docstr = [ '.pdf','.ppt','.doc' ]  
webstr = ['.html','.htm', '.asp', '.jsp', '.php', '.cgi', '.js','.css']
compressed = ['zip', 'rar', 'gzip', 'tar', 'gz', '7z']



def rtype(b):
    if any(x in b for x in imgstr):
        return 'A'
    elif any(x in b for x in docstr):
        return 'B'
    elif 'favicon.ico'in b:
        return 'C'
    elif 'robots.txt'in b:
        return 'D'
    elif 'GET / HTTP/1.1' in b:
        return 'E'
    elif any(x in b for x in webstr):
        return 'F'
    elif any(x in b for x in compressed):
        return 'G'
    else:
        return 'H'

df2['result'] = df2.Request.apply(rtype)

但df2['result']只有'A'？ df2.Request的数据类型为Object。我尝试用df2['Referer'] = df2['Referer'].astype(str)更改它。 dtype仍为Object。以下是前10个df2.Request。

0,GET /index.php?lang=ta HTTP/1.1
1,GET /index.php?limitstart=25&lang=en HTTP/1.1
2,GET /index.php/ta/component/content/article/43 HTTP/1.1
3,GET /index.php/ta/component/content/article/39-test HTTP/1.1
4,GET /robots.txt HTTP/1.1
5,GET /robots.txt HTTP/1.1
6,GET /index.php/en/computer-security-feeds/15-computer-security/2-us-cert-cyber-security-alerts HTTP/1.1
7,GET /index.php/component/content/article/10-tips/59-use-firefox-more-safe HTTP/1.1
8,GET /robots.txt HTTP/1.1
9,GET /onlinerenew/ HTTP/1.1

Answer 1

我可能会使用正则表达式。

import pandas as pd
import re

def categoriser(x):

if re.search('(.png|.gif|.jpeg|.jpg)', x):
    return 'A'
elif re.search('(.pdf|.ppt|.doc)', x):
    return 'B'
elif 'favicon.ico'in x:
    return 'C'
elif 'robots.txt'in x:
    return 'D'
elif 'GET / HTTP/1.1' in x:
    return 'E'
elif re.search('(.html|.htm|.asp|.jsp|.php|.cgi|.js|.css)', x):
    return 'F'
elif re.search('(zip|rar|gzip|tar|gz|7z)', x):
    return 'G'
else:
    return 'H'


string = """0,GET /index.php?lang=ta HTTP/1.1
1,GET /index.php?limitstart=25&lang=en HTTP/1.1
2,GET /index.php/ta/component/content/article/43 HTTP/1.1
3,GET /index.php/ta/component/content/article/39-test HTTP/1.1
4,GET /robots.txt HTTP/1.1
5,GET /robots.txt HTTP/1.1
6,GET /index.php/en/computer-security-feeds/15-computer-security/2-us-cert-cyber-security-alerts HTTP/1.1
7,GET /index.php/component/content/article/10-tips/59-use-firefox-more-safe HTTP/1.1
8,GET /robots.txt HTTP/1.1
9,GET /onlinerenew/ HTTP/1.1"""    

frame = pd.DataFrame([x.split(",") for x in string.split("\n")])

print frame.loc[:,1].apply(categoriser)

结果是：

0    F
1    F
2    F
3    F
4    D
5    D
6    F
7    F
8    D
9    H
Name: 1, dtype: object

这就是你想要的吗？如果你下次能包含所需的输出会很好:) dtype：object的东西是，基于数据帧的numpy数组调用字符串和一堆其他东西对象......在这种情况下它仍然是一个字符串：）

Pandas DataFrame根据列表中的值插入列

1 个答案: