所以我有一个json文件,我将数据导入到python中。
我在JSON中有一个agentId字段和一个agentText字段
示例json:
{
"messages":
[
{"agentId": "1", "agentText": "I Love Python"},
{"agentId": "2", "agentText": "but cant seem to get my head around it"},
{"agentId": "3", "agentText": "what are the alternatives?"}
]
}
我正在尝试通过执行以下操作来创建带有agentIds和AgentText字段的字典/密钥对值:
当我这样做时,键值对可以正常工作:
import json
with open('20190626-101200-text-messages2.json', 'r') as f:
data = json.load(f)
for message in data['messages']:
agentIdandText = {message['agentId']: [message['agentText']]}
print(agentIdandText)
和我得到的输出:
{'1': ['I love python']}
{'2': ["but cant seem to get my head around it"]}
{'3': ['what are the alternatives?']}
但是当我尝试对单词进行标记(如下)时,我开始遇到错误
from nltk.tokenize import TweetTokenizer
varToken = TweetTokenizer()
import json
with open('20190626-101200-text-messages2.json', 'r') as f:
data = json.load(f)
for message in data['messages']:
agentIdandText = {message['agentId']: varToken.tokenize([message['agentText']])}
print(agentIdandText)
部分错误消息(通过评论编辑):
return ENT_RE.sub(_convert_entity, _str_to_unicode(text, encoding))
TypeError: expected string or bytes-like object
所以我期望的是:
{
'1': ['I', 'love', 'python'],
'2': ['but', 'cant', 'seem', 'to', 'get', 'my', 'head', 'around', 'it'],
'3': ['what', 'are', 'the', 'alternatives?']
}
我该如何实现?
答案 0 :(得分:3)
此更改是否可以解决您的问题, 我认为您必须将字符串传递给tokenize函数。
from nltk.tokenize import TweetTokenizer
varToken = TweetTokenizer()
import json
with open('20190626-101200-text-messages2.json', 'r') as f:
data = json.load(f)
output_data = {}
for message in data['messages']:
agentIdandText = {message['agentId']: varToken.tokenize(message['agentText'])}
#print(agentIdandText)
output_data.update(agentIdandText)
print (output_data)
编辑: 添加了output_data变量以展示一本字典中的所有键。
答案 1 :(得分:0)
您可以将str.split()
用于字典理解:
agentIdandText = {d['agentId']: d['agentText'].split() for d in data["messages"]}
输出:
{
'1': ['I', 'Love', 'Python'],
'2': ['but', 'cant', 'seem', 'to', 'get', 'my', 'head', 'around', 'it'],
'3': ['what', 'are', 'the', 'alternatives?']
}
答案 2 :(得分:0)
temp = {
"messages":
[
{"agentId": "1", "agentText": "I Love Python"},
{"agentId": "2", "agentText": "but cant seem to get my head around it"},
{"agentId": "3", "agentText": "what are the alternatives?"}
]
}
result = [ {e['agentId']:e['agentText'].split()} for e in temp['messages']]
for e in result:
print(e)
#result
{'1': ['I', 'Love', 'Python']}
{'2': ['but', 'cant', 'seem', 'to', 'get', 'my', 'head', 'around', 'it']}
{'3': ['what', 'are', 'the', 'alternatives?']}
您应该查看split