我有一个JSON文件,其中包含带有文本字段的多个对象:
{
"messages":
[
{"timestamp": "123456789", "timestampIso": "2019-06-26 09:51:00", "agentId": "2001-100001", "skillId": "2001-20000", "agentText": "That customer was great"},
{"timestamp": "123456789", "timestampIso": "2019-06-26 09:55:00", "agentId": "2001-100001", "skillId": "2001-20001", "agentText": "That customer was stupid\nI hope they don't phone back"},
{"timestamp": "123456789", "timestampIso": "2019-06-26 09:57:00", "agentId": "2001-100001", "skillId": "2001-20002", "agentText": "Line number 3"},
{"timestamp": "123456789", "timestampIso": "2019-06-26 09:59:00", "agentId": "2001-100001", "skillId": "2001-20003", "agentText": ""}
]
}
我只对“ agentText”字段感兴趣。
我基本上需要去除agentText字段中的每个单词,并对单词的出现次数进行计数。
所以我的python代码:
import json
with open('20190626-101200-text-messages.json') as f:
data = json.load(f)
for message in data['messages']:
splittext= message['agentText'].strip().replace('\n',' ').replace('\r',' ')
if len(splittext)>0:
splittext2 = splittext.split(' ')
print(splittext2)
给我这个:
['That', 'customer', 'was', 'great']
['That', 'customer', 'was', 'stupid', 'I', 'hope', 'they', "don't", 'phone', 'back']
['Line', 'number', '3']
如何将每个单词加到具有计数的数组中? 就像;
That 2
customer 2
was 2
great 1
..
以此类推?
答案 0 :(得分:1)
data = '''{"messages":
[
{"timestamp": "123456789", "timestampIso": "2019-06-26 09:51:00", "agentId": "2001-100001", "skillId": "2001-20000", "agentText": "That customer was great"},
{"timestamp": "123456789", "timestampIso": "2019-06-26 09:55:00", "agentId": "2001-100001", "skillId": "2001-20001", "agentText": "That customer was stupid I hope they don't phone back"},
{"timestamp": "123456789", "timestampIso": "2019-06-26 09:57:00", "agentId": "2001-100001", "skillId": "2001-20002", "agentText": "Line number 3"},
{"timestamp": "123456789", "timestampIso": "2019-06-26 09:59:00", "agentId": "2001-100001", "skillId": "2001-20003", "agentText": ""}
]
}
'''
import json
from collections import Counter
from pprint import pprint
def words(data):
for m in data['messages']:
yield from m['agentText'].split()
c = Counter(words(json.loads(data)))
pprint(c.most_common())
打印:
[('That', 2),
('customer', 2),
('was', 2),
('great', 1),
('stupid', 1),
('I', 1),
('hope', 1),
('they', 1),
("don't", 1),
('phone', 1),
('back', 1),
('Line', 1),
('number', 1),
('3', 1)]
答案 1 :(得分:1)
检查一下。
data = {
"messages":
[
{"timestamp": "123456789", "timestampIso": "2019-06-26 09:51:00", "agentId": "2001-100001", "skillId": "2001-20000", "agentText": "That customer was great"},
{"timestamp": "123456789", "timestampIso": "2019-06-26 09:55:00", "agentId": "2001-100001", "skillId": "2001-20001", "agentText": "That customer was stupid\nI hope they don't phone back"},
{"timestamp": "123456789", "timestampIso": "2019-06-26 09:57:00", "agentId": "2001-100001", "skillId": "2001-20002", "agentText": "Line number 3"},
{"timestamp": "123456789", "timestampIso": "2019-06-26 09:59:00", "agentId": "2001-100001", "skillId": "2001-20003", "agentText": ""}
]
}
var = []
for row in data['messages']:
new_row = row['agentText'].split()
if new_row:
var.append(new_row)
temp = dict()
for e in var:
for j in e:
if j in temp:
temp[j] = temp[j] + 1
else:
temp[j] = 1
for key, value in temp.items():
print(f'{key}: {value}')