新的大熊猫任何帮助表示赞赏
def csv_reader(fileName):
reqcols=['_id__$oid','payload','channel']
io = pd.read_csv(fileName,sep=",",usecols=reqcols)
print(io['payload'].values)
return io
io ['payload']的输出行:
{
"destination_ip": "172.31.14.66",
"date": "2014-10-19T01:32:36.669861",
"classification": "Potentially Bad Traffic",
"proto": "UDP",
"source_ip": "172.31.0.2",
"priority": "`2",
"header": "1:2003195:5",
"signature": "ET POLICY Unusual number of DNS No Such Name Responses ",
"source_port": "53",
"destination_port": "34638",
"sensor": "5cda4a12-4730-11e4-9ee4-0a0b6e7c3e9e"
}
我正在尝试从ndarray对象中提取特定数据。什么是可用于从数据框中提取的方法
"destination_ip": "172.31.13.124",
"proto": "ICMP",
"source_ip": "201.158.32.1",
"date": "2014-09-28T14:49:43.391463",
"sensor": "139cfdf2-471e-11e4-9ee4-0a0b6e7c3e9e"
答案 0 :(得分:2)
我认为您需要先在string
列中的dicts
或dictionaries
转换json.loads
ast.literal_eval
至payload
的{{1}},然后按构造函数创建新的DataFrame
,按子集过滤列,如有必要,按concat
添加原始列:
d = {'_id__$oid': ['542f8', '542f8', '542f8'], 'channel': ['snort_alert', 'snort_alert', 'snort_alert'], 'payload': ['{"destination_ip":"172.31.14.66","date": "2014-10-19T01:32:36.669861","classification":"Potentially Bad Traffic","proto":"UDP","source_ip":"172.31.0.2","priority":"2","header":"1:2003195:5","signature":"ET POLICY Unusual number of DNS No Such Name Responses ","source_port":"53","destination_port":"34638","sensor":"5cda4a12-4730-11e4-9ee4-0a0b6e7c3e9e"}', '{"destination_ip":"172.31.14.66","date": "2014-10-19T01:32:36.669861","classification":"Potentially Bad Traffic","proto":"UDP","source_ip":"172.31.0.2","priority":"2","header":"1:2003195:5","signature":"ET POLICY Unusual number of DNS No Such Name Responses ","source_port":"53","destination_port":"34638","sensor":"5cda4a12-4730-11e4-9ee4-0a0b6e7c3e9e"}', '{"destination_ip":"172.31.14.66","date": "2014-10-19T01:32:36.669861","classification":"Potentially Bad Traffic","proto":"UDP","source_ip":"172.31.0.2","priority":"2","header":"1:2003195:5","signature":"ET POLICY Unusual number of DNS No Such Name Responses ","source_port":"53","destination_port":"34638","sensor":"5cda4a12-4730-11e4-9ee4-0a0b6e7c3e9e"}']}
reqcols=['_id__$oid','payload','channel']
df = pd.DataFrame(d)
print (df)
_id__$oid channel payload
0 542f8 snort_alert {"destination_ip":"172.31.14.66","date": "2014...
1 542f8 snort_alert {"destination_ip":"172.31.14.66","date": "2014...
2 542f8 snort_alert {"destination_ip":"172.31.14.66","date": "2014...
import json
import ast
df.payload = df.payload.apply(json.loads)
#another slowier solution
#df.payload = df.payload.apply(ast.literal_eval)
required = ["destination_ip", "proto", "source_ip", "date", "sensor"]
df1 = pd.DataFrame(df.payload.values.tolist())[required]
print (df1)
destination_ip proto source_ip date \
0 172.31.14.66 UDP 172.31.0.2 2014-10-19T01:32:36.669861
1 172.31.14.66 UDP 172.31.0.2 2014-10-19T01:32:36.669861
2 172.31.14.66 UDP 172.31.0.2 2014-10-19T01:32:36.669861
sensor
0 5cda4a12-4730-11e4-9ee4-0a0b6e7c3e9e
1 5cda4a12-4730-11e4-9ee4-0a0b6e7c3e9e
2 5cda4a12-4730-11e4-9ee4-0a0b6e7c3e9e
df2 = pd.concat([df[['_id__$oid','channel']], df1], axis=1)
print (df2)
_id__$oid channel destination_ip proto source_ip \
0 542f8 snort_alert 172.31.14.66 UDP 172.31.0.2
1 542f8 snort_alert 172.31.14.66 UDP 172.31.0.2
2 542f8 snort_alert 172.31.14.66 UDP 172.31.0.2
date sensor
0 2014-10-19T01:32:36.669861 5cda4a12-4730-11e4-9ee4-0a0b6e7c3e9e
1 2014-10-19T01:32:36.669861 5cda4a12-4730-11e4-9ee4-0a0b6e7c3e9e
2 2014-10-19T01:32:36.669861 5cda4a12-4730-11e4-9ee4-0a0b6e7c3e9e
<强>计时强>:
#[30000 rows x 3 columns]
df = pd.concat([df]*10000).reset_index(drop=True)
print (df)
In [38]: %timeit pd.DataFrame(df.payload.apply(json.loads).values.tolist())[required]
1 loop, best of 3: 379 ms per loop
In [39]: %timeit pd.read_json('[{}]'.format(df.payload.str.cat(sep=',')))[required]
1 loop, best of 3: 528 ms per loop
In [40]: %timeit pd.DataFrame(df.payload.apply(ast.literal_eval).values.tolist())[required]
1 loop, best of 3: 1.98 s per loop
答案 1 :(得分:1)
使用@ jezrael&#39的样本df
d = {'_id__$oid': ['542f8', '542f8', '542f8'], 'channel': ['snort_alert', 'snort_alert', 'snort_alert'], 'payload': ['{"destination_ip":"172.31.14.66","date": "2014-10-19T01:32:36.669861","classification":"Potentially Bad Traffic","proto":"UDP","source_ip":"172.31.0.2","priority":"2","header":"1:2003195:5","signature":"ET POLICY Unusual number of DNS No Such Name Responses ","source_port":"53","destination_port":"34638","sensor":"5cda4a12-4730-11e4-9ee4-0a0b6e7c3e9e"}', '{"destination_ip":"172.31.14.66","date": "2014-10-19T01:32:36.669861","classification":"Potentially Bad Traffic","proto":"UDP","source_ip":"172.31.0.2","priority":"2","header":"1:2003195:5","signature":"ET POLICY Unusual number of DNS No Such Name Responses ","source_port":"53","destination_port":"34638","sensor":"5cda4a12-4730-11e4-9ee4-0a0b6e7c3e9e"}', '{"destination_ip":"172.31.14.66","date": "2014-10-19T01:32:36.669861","classification":"Potentially Bad Traffic","proto":"UDP","source_ip":"172.31.0.2","priority":"2","header":"1:2003195:5","signature":"ET POLICY Unusual number of DNS No Such Name Responses ","source_port":"53","destination_port":"34638","sensor":"5cda4a12-4730-11e4-9ee4-0a0b6e7c3e9e"}']}
df = pd.DataFrame(d)
解决方案
payload
与经过审核的str.cat
pd.read_json
cols = 'destination_ip proto source_ip date sensor'.split()
df.drop(
'payload', 1
).join(
pd.read_json('[{}]'.format(df.payload.str.cat(sep=',')))[cols]
)
答案 2 :(得分:0)
访问pandas中的列非常简单。只需传递所需列的列表:
<强>代码:强>
columns = ["destination_ip", "proto", "source_ip", "date", "sensor"]
extracted_data = df[columns]
测试代码:
data = {
"destination_ip": "172.31.14.66",
"date": "2014-10-19T01:32:36.669861",
"classification": "Potentially Bad Traffic",
"proto": "UDP",
"source_ip": "172.31.0.2",
"priority": "`2",
"header": "1:2003195:5",
"signature": "ET POLICY Unusual number of DNS No Such Name Responses ",
"source_port": "53",
"destination_port": "34638",
"sensor": "5cda4a12-4730-11e4-9ee4-0a0b6e7c3e9e"
}
df = pd.DataFrame([data, data])
columns = ["destination_ip", "proto", "source_ip", "date", "sensor"]
print(df[columns])
<强>结果:强>
destination_ip proto source_ip date \
0 172.31.14.66 UDP 172.31.0.2 2014-10-19T01:32:36.669861
1 172.31.14.66 UDP 172.31.0.2 2014-10-19T01:32:36.669861
sensor
0 5cda4a12-4730-11e4-9ee4-0a0b6e7c3e9e
1 5cda4a12-4730-11e4-9ee4-0a0b6e7c3e9e
答案 3 :(得分:0)
问题是payload
是CSV输入数据的一列,它是一个JSON字符串。所以你首先可以read_csv()
解析整个文件,但是你需要解析里面的每个JSON对象。让我们使用这个示例数据:
payload = pd.Series(['{"a":1, "b":2}', '{"b":4, "c":5}'])
现在制作一个JSON字符串:
json = ','.join(payload).join('[]')
给出了:
'[{"a":1, "b":2}, {"b":4, "c":5}]'
然后解析它:
pd.read_json(json)
获得:
a b c
0 1.0 2 NaN
1 NaN 4 5.0