我有一张这样的表:
protocol packets bytes bytes/pkt
------------------------------------------------------------------------
total 78913220 (100.00%) 47623614577 (100.00%) 603.49
ip 76930821 ( 97.49%) 45706321977 ( 95.97%) 594.12
tcp 45432316 ( 57.57%) 38990240707 ( 81.87%) 858.20
实际上,您可以在WIDE MAWI WorkingGroup中找到一些示例。
我使用简单的Python代码获取数据,然后我想将每个项目存储在某些结构中,如dict
。
例如:
这不是一个实用的代码!这只是我需要的半代码。
import httplib2
from BeautifulSoup import BeautifulSoup, SoupStrainer
import pandas as pd
http = httplib2.Http()
status, response = http.request('http://mawi.wide.ad.jp/mawi/ditl/ditl2017/201704131545.html')
for item in BeautifulSoup(response, parseOnlyThese=SoupStrainer('pre')):
res = item.text
pd.read_somefunction_to_read_string(res)
if pd['protocol']['ip'] > .09 * pd['protocol']['total']
do_something
预期产出:
[
{'protocol' : 'total', 'packet' : 78913220, 'bytes' : 47623614577},
{'protocol' : 'ip', 'packet' : 76930821, 'bytes' : 45706321977}
]
答案 0 :(得分:1)
我尝试将数据提取并解析为dict列表:
import requests
from bs4 import BeautifulSoup, SoupStrainer
r = requests.get('http://mawi.wide.ad.jp/mawi/ditl/ditl2017/201704120145.html')
pre = BeautifulSoup(r.content, "html.parser", parse_only=SoupStrainer('pre'))
entries = pre.text.split("\n")
keys = list(filter(None, entries.pop(0).strip().split("\t")))
entries.pop(0)
rows = []
for entry in entries:
row = list(filter(None, entry.strip().split(" ")))
if (len(row)):
result = {};
result[keys[0]] = row[0]
result[keys[1]] = row[1]
result[keys[2]] = row[2]
rows.append(result)
print(rows)
(未使用过的熊猫,所以请将其余部分留给您)
答案 1 :(得分:1)
首先,可以通过换行符将响应拆分为行。然后对于每一行:protocol
,packet
和bytes
字段可以使用正则表达式提取。然后为它们添加一个dict列表(lst_dict
)。最后将lst_dict
转换为pandas DataFrame。
import httplib2
from BeautifulSoup import BeautifulSoup, SoupStrainer
import pandas as pd
import re
lst_dict = []
http = httplib2.Http()
status, response = http.request('http://mawi.wide.ad.jp/mawi/ditl/ditl2017/201704131545.html')
res = BeautifulSoup(response, parseOnlyThese=SoupStrainer('pre'))
items = res.text.split("\n")
for item in items[2:]:
item = item.strip()
protocol = re.search('(\w+)\s.*', item).group(1)
packet = re.search('\w+\s*(\w+)\s.*', item).group(1)
byts = re.search('\w+\s*\w+\s\(.*\)\s+(\w+)\s.*', item).group(1)
dict = {'protocol': protocol, 'packet': packet, 'bytes': byts}
lst_dict.append(dict)
df = pd.DataFrame(lst_dict)
print df