我正在尝试执行此代码。
import urllib
import json
import requests
url = 'http://www.webiron.com/abuse_feed//?format=json'
response = urllib.urlopen(url)
data_json = json.loads(response.read())
for i in data_json:
i['LogEvent'] = 'Trial'
i['EvtLen'] = 213
print json.dumps(data_json,indent=6)
我得到的输出如下:相当大的输出(显示输出的一部分)
[
{
"incidents_reported": 3,
"attacker_ip": "178.137.88.8",
"event_time": "2018-05-15 19:30:09.832568-07",
"event_emails": [
"hostmaster@kyivstar.net",
"abuse@kyivstar.net",
"noc@kyivstar.net"
],
"entry_type": "report",
"EvtLen": 213,
"emails_deliverable": "Yes",
"LogEvent": "Trial",
"event_msg": "Fake Referrer Log SPAM Bot",
"days_unresolved": "<font color=\"green\"><3</font>"
},
{
"incidents_reported": 52,
"attacker_ip": "221.229.166.171",
"event_time": "2018-05-15 19:29:45.039281-07",
"event_emails": [
"anti-spam@ns.chinanet.cn.net"
],
"entry_type": "report",
"EvtLen": 213,
"emails_deliverable": "No",
"LogEvent": "Trial",
"event_msg": "Abusive network connectivity",
"days_unresolved": "<font color=\"red\">3</font>"
} ]
现在检查元素days_unresolved : "<font color=\"red\">3</font>"
或days_unresolved: "<font color=\"green\"><3</font>"
是否可以将此类元素更新或修改为简单
days_unresolved : 3
还有其他数据吗?这就是我想要的输出,是有一些方法或找到这样的标签并删除它或迭代整个数据集并更新它。对此有什么解决方案吗?
答案 0 :(得分:0)
WellSoup bs4可以用来刮取HTML标签。否则使用正则表达式来识别&gt;之间的内容。和&lt;
答案 1 :(得分:0)
您可以使用HTMLParser
(python3)或html.parser
(python3)中的HTMLParser
- 它位于标准库中,因此不需要您安装任何其他软件包:
try:
from html.parser import HTMLParser
except ImportError:
from HTMLParser import HTMLParser
class MLStripper(HTMLParser):
strict = False
convert_charrefs = True
def __init__(self):
self.reset()
self.fed = []
def handle_data(self, data):
self.fed.append(data)
def get_data(self):
return ''.join(self.fed)
def strip_tags(html):
s = MLStripper()
s.feed(html)
return s.get_data()
答案 2 :(得分:0)
使用Regex更新days_unresolved
<强>演示:强>
import re
for i in data:
if i.get("days_unresolved"):
m = re.search('>(.*?)<', i["days_unresolved"])
i["days_unresolved"] = m.group(1) if m else i["days_unresolved"]
print(data)
答案 3 :(得分:0)
label
print soup.a.string&gt;&gt;&gt; '马丁埃利亚斯' 要么 print soup.text