xml文本到标签的矩阵

时间:2017-03-01 02:27:43

标签: python xml nlp

我想从xml文件中创建一个矩阵或标签列表。

例如,给定xml:

<?xml version="1.0" encoding="UTF-8"?>
Words about John.  
 <person Key="NameHash:001">John Master-Smith</person> is a coder.
<person Key="NameHash:002">Aleksandra</person> likes goldfish.
She likes to dance.

我可以通过python检索文本:

with open('test.txt.xml') as f:
    soup = BeautifulSoup(f, 'html.parser')
    text = soup.get_text()

返回:

Words about John.
 John Master-Smith is a coder.
Aleksandra likes goldfish.
She likes to dance.

之后我分成了令牌:

[['Words', 'about', 'John']
['John', 'Master-Smith', 'is', 'a', 'coder']
['Aleksandra', 'likes', 'goldfish']
['She', 'likes', 'to', 'dance']]

我想要做的是将这些标记映射到一个数组,该数组指示标记是否在标记中。

即。我想回复:

[[None, None, None]
['NameHash:001', 'NameHash:001', None, None, None]
['NameHash:002', None, None]
[None, None, None, None]]

有谁知道如何做到这一点?

我已经尝试了这个,但不幸的是,我希望有能力知道原始xml中的文本是否实际存在于标记内,而不仅仅是查看标记中是否存在某些用户给定的字符串来自xml。

with open('test.txt.xml') as f:
    soup = BeautifulSoup(f, 'html.parser')
    text = soup.get_text()
    tag_text = [x.string for x in soup.find_all('person')]
    # split into word tokens with specific tokenizer
    sentences = tokenize(text)
    tag_mask_all_sentences = []
    for sentence in sentences:
        print(sentence)
        sentence_mask = []
        for word in sentence:
            found = False
            for s in tag_text:
                if word in s:
                    sentence_mask.append(1)
                    found = True
            if found==False:
                sentence_mask.append(None)
        tag_mask_all_sentences.append(sentence_mask)

    for tag_mask in tag_mask_all_sentences:
        print(tag_mask)

返回:

['Words', 'about', 'John']
['John', 'Master-Smith', 'is', 'a', 'coder']
['Aleksandra', 'likes', 'goldfish']
['She', 'likes', 'to', 'dance']

[None, None, 1]
[1, 1, None, 1, 1, None]
[1, None, None]
[None, None, None, None]

你可以看到这是不对的,正如第一句“John&#39;不在标签中。我真的不确定发生了什么&#39;是&#39; &#39; a&#39; ...我认为是因为它发现这些字符存在于标记文本中 - 这显然是非常错误的。

最终考虑我需要的输出更简单的方法是这样的 - w是一个单词,0用于填充,.用于非标记和{ {1}}用于标记。

0 个答案:

没有答案