我将XML数据解析为dict。该字典具有以下形式:
{'id': 'Q1',
'subject': 'Massage oil',
'question': 'Where I can buy good oil for massage?',
'comments': {},
'related': {'Q1_R1': {'rid': 'Q1_R1',
'rel_subject': 'massage oil',
'rel_question': 'is there any place i can find scented massage oils in qatar?',
'rel_givenRelevance': 'PerfectMatch',
'rel_givenRank': '1',
'rel_comments': {'Q1_R1_C1': {'cid': 'Q1_R1_C1',
'com_date': '2010-08-27 01:40:05',
'com_username': 'anonymous',
'comment': 'Yes. It is right behind Kahrama in the National area.',
'com_isTraining': True},
'Q1_R1_C2': {'cid': 'Q1_R1_C2',
'com_date': '2010-08-27 01:42:59',
'com_username': 'sognabodl',
'comment': 'whats the name of the shop?',
'com_isTraining': True},
'Q1_R1_C3': {'cid': 'Q1_R1_C3',
'com_date': '2010-08-27 01:44:09',
'com_username': 'anonymous',
'comment': "It's called Naseem Al-Nadir. Right next to the Smartlink shop. You'll find the chinese salesgirls at affordable prices there.",
'com_isTraining': True},
'Q1_R1_C4': {'cid': 'Q1_R1_C4',
'com_date': '2010-08-27 01:58:39',
'com_username': 'sognabodl',
'comment': 'dont want girls;want oil',
'com_isTraining': True},
'Q1_R1_C5': {'cid': 'Q1_R1_C5',
'com_date': '2010-08-27 01:59:55',
'com_username': 'anonymous',
'comment': "Try Both ;) I'am just trying to be helpful. On a serious note - Please go there. you'll find what you are looking for.",
'com_isTraining': True},
'Q1_R1_C6': {'cid': 'Q1_R1_C6',
'com_date': '2010-08-27 02:02:53',
'com_username': 'lawa',
'comment': 'you mean oil and filter both',
'com_isTraining': True},
'Q1_R1_C7': {'cid': 'Q1_R1_C7',
'com_date': '2010-08-27 02:04:29',
'com_username': 'anonymous',
'comment': "Yes Lawa...you couldn't be more right LOL",
'com_isTraining': True}},
'rel_featureVector': [],
'rel_isTraining': True}},
'featureVector': [],
'isTraining': True}
一般如:
{ID : Q1,
...
related:{
Q1_R1 :{
rid:Q1_R1,
....
rel_comments:{
Q1_R1_C1: {
cid: Q1_R1_C1,
....
}
....
Q1_R1_C10
}
...
Q1_R10
}
...
ID : 100
}
我想把它变成:
ID ... question rid ... rel_question cid .... comment
Q1 ... 1234 Q1_R1 ... 5678 Q1_R1_c1 .... 90
Q1 ... 1234 Q1_R1 ... 5678 Q1_R1_c2 .... 92
Q1 ... 1234 Q1_R1 ... 5678 Q1_R1_c3 .... 93
..........................................
Q100 ... 1234 Q100_R10 ... 5678 Q100_R10_c13 ....465
我试图弄平这个字典,但是我得到rid(Q1_R1 ...Q100_R10 )
和cid( Q1_R1_c1 ... Q100_R10_c13 )
作为列,有什么办法吗?
此semeval 2016子任务1'数据,我认为使用dataframe函数,例如apply ..
可以提高性能。例如,要计算Q1
问题和Q1_R1_C1
评论有多相似?...
答案 0 :(得分:0)
您必须遍历字典的结构并生成另一个具有正确结构的字典,以便熊猫可以从中制作出所需的DataFrame。这里仅适用于某些列,但您应该明白这一点:
df_dict = {
'id': [],
'subject': [],
'question': [],
'rid': [],
'rel_question': [],
'cid': [],
'comment': []
}
for rid in mydict['related']:
for cid in mydict['related'][rid]['rel_comments']:
df_dict['id'].append(mydict['id'])
df_dict['subject'].append(mydict['subject'])
df_dict['question'].append(mydict['question'])
df_dict['rid'].append(rid)
df_dict['rel_question'].append(mydict['related'][rid]['rel_question'])
df_dict['cid'].append(cid)
df_dict['comment'].append(mydict['related'][rid]['rel_comments'][cid]['comment'])
df = pd.DataFrame(df_dict)