Question

我想根据字段类型的值从xml文件中提取内容。基本上，它是一个json文件，我转换为xml。该文件包含字段，＆＃39; body＆＃39;，＆＃39; id＆＃39;，＆＃39; type＆＃39;和片段。如果＆＃39; type =＆＃39;摘要＆＃39;，我想提取所有这些字段的内容。我所做的代码是：

def load_extract(data):
    path=""
    soup = BeautifulSoup(open(path),"html.parser")
    q1=[]
    qtype=[]
    snippets=[]
    for q in soup.findAll('body'):
            q=q.text
            q1.append(q)
    for types in soup.findAll('type'):

            type1=types.text
            qtype.append(type1)
    snippets=soup.findAll('snippets')
    summary_ids=[]
    summary_dict=[]
    for i in range (0, len(qtype)):
            print "extracting the summary type question"
            if qtype[i]=='summary':
               summary_ids.append(i)
    for j in summary_ids:
            summary_dict.append({q1[j]:snippets[j]})
    return summary_dict

代码在我运行的小集合上正常工作，但是对于大集合，len（q1）不等于len（片段）。这会产生问题。我不知道训练数据是否实际上没有某些身体的片段。但这会在映射和提取中产生问题。我在想是否可以提取出类型=＆＃39;摘要＆＃39;的正文，内容和片段。请求您的帮助！

如何使用python

0 个答案: