这可能是一个简单的问题。但是我浪费了很多时间而没有弄清楚这里发生了什么。我想基于资源扩展在Web日志文件中对HTTP请求进行分类。以下是我的尝试。
imgstr = ['.png','.gif','.jpeg','.jpg']
docstr = [ '.pdf','.ppt','.doc' ]
webstr = ['.html','.htm', '.asp', '.jsp', '.php', '.cgi', '.js','.css']
compressed = ['zip', 'rar', 'gzip', 'tar', 'gz', '7z']
def rtype(b):
if any(x in b for x in imgstr):
return 'A'
elif any(x in b for x in docstr):
return 'B'
elif 'favicon.ico'in b:
return 'C'
elif 'robots.txt'in b:
return 'D'
elif 'GET / HTTP/1.1' in b:
return 'E'
elif any(x in b for x in webstr):
return 'F'
elif any(x in b for x in compressed):
return 'G'
else:
return 'H'
df2['result'] = df2.Request.apply(rtype)
但df2['result']
只有'A'
? df2.Request
的数据类型为Object
。我尝试用df2['Referer'] = df2['Referer'].astype(str)
更改它。 dtype仍为Object
。
以下是前10个df2.Request
。
0,GET /index.php?lang=ta HTTP/1.1
1,GET /index.php?limitstart=25&lang=en HTTP/1.1
2,GET /index.php/ta/component/content/article/43 HTTP/1.1
3,GET /index.php/ta/component/content/article/39-test HTTP/1.1
4,GET /robots.txt HTTP/1.1
5,GET /robots.txt HTTP/1.1
6,GET /index.php/en/computer-security-feeds/15-computer-security/2-us-cert-cyber-security-alerts HTTP/1.1
7,GET /index.php/component/content/article/10-tips/59-use-firefox-more-safe HTTP/1.1
8,GET /robots.txt HTTP/1.1
9,GET /onlinerenew/ HTTP/1.1
答案 0 :(得分:0)
我可能会使用正则表达式。
import pandas as pd
import re
def categoriser(x):
if re.search('(.png|.gif|.jpeg|.jpg)', x):
return 'A'
elif re.search('(.pdf|.ppt|.doc)', x):
return 'B'
elif 'favicon.ico'in x:
return 'C'
elif 'robots.txt'in x:
return 'D'
elif 'GET / HTTP/1.1' in x:
return 'E'
elif re.search('(.html|.htm|.asp|.jsp|.php|.cgi|.js|.css)', x):
return 'F'
elif re.search('(zip|rar|gzip|tar|gz|7z)', x):
return 'G'
else:
return 'H'
string = """0,GET /index.php?lang=ta HTTP/1.1
1,GET /index.php?limitstart=25&lang=en HTTP/1.1
2,GET /index.php/ta/component/content/article/43 HTTP/1.1
3,GET /index.php/ta/component/content/article/39-test HTTP/1.1
4,GET /robots.txt HTTP/1.1
5,GET /robots.txt HTTP/1.1
6,GET /index.php/en/computer-security-feeds/15-computer-security/2-us-cert-cyber-security-alerts HTTP/1.1
7,GET /index.php/component/content/article/10-tips/59-use-firefox-more-safe HTTP/1.1
8,GET /robots.txt HTTP/1.1
9,GET /onlinerenew/ HTTP/1.1"""
frame = pd.DataFrame([x.split(",") for x in string.split("\n")])
print frame.loc[:,1].apply(categoriser)
结果是:
0 F
1 F
2 F
3 F
4 D
5 D
6 F
7 F
8 D
9 H
Name: 1, dtype: object
这就是你想要的吗?如果你下次能包含所需的输出会很好:) dtype:object的东西是,基于数据帧的numpy数组调用字符串和一堆其他东西对象......在这种情况下它仍然是一个字符串:)