使用BeautifulSoup在结果集的标记(<topic>)内排除标记(<pattern>)

时间:2018-08-09 01:06:04

标签: python beautifulsoup tags aiml

我只是刚开始使用Python进行网络抓取,目前我正在使用BeautifulSoup进行数据提取。我有这个.aiml文件(xml),其中我想从 topic 标签内 NOLULUDED 的标签 pattern 中提取所有数据

我已经获得了所有模式值,但是这里的挑战在于,那些具有 topic 父标记的模式不应该包含在结果集中。

这是aiml文件:

<?xml version = "1.0" encoding = "UTF-8"?>

<aiml version="1.0.1" encoding="UTF-8">
  <topic name="botdog">
   <category>
      <pattern>MY DOG'S NAME IS *</pattern>
      <template>
         That is interesting that you have a dog named <set name="dog"><star/></set>
      </template>  
   </category>

   <category>
      <pattern>WHAT IS MY DOG'S NAME</pattern>
      <template>
         Your dog's name is <get name="dog"/>.
      </template>  
   </category>  
  </topic>

  <topic name="botcat">
   <category>
      <pattern>MY CAT'S NAME IS *</pattern>
      <template>
         That is interesting that you have a cat named <set name="cat"><star/></set>
      </template>  
   </category>

   <category>
      <pattern>WHAT IS MY CAT'S NAME</pattern>
      <template>
         Your cat's name is <get name="cat"/>.
      </template>  
   </category>  
  </topic>


  <category>
      <pattern>HELLO ALICE</pattern>
      <template>
         Hello User
      </template>
   </category>

   <category>
      <pattern>HOW ARE YOU</pattern>
      <template>
         I'm fine
      </template>
   </category>
</aiml>

Python代码(烧瓶):

@extract.route('/')
def index_page():
    folder = 'templates/topic.aiml'
    with open(folder, 'r') as myfile:
        soup = BeautifulSoup(myfile.read(), 'html.parser')
    data_topic = [match.pattern.text for match in soup.find_all('category')]

    print(data_topic)


    # data = " ".join(data_set)

    return jsonify({'data_set': data_topic})

print()返回的值为:

[“我的狗的名字*”,“我的狗的名字是什么”,“我的猫的名字是*”,“我的猫的名字是什么”,“ HELLO ALICE”,“你怎么了”]

仅应这样,因为它没有父标签 topic ['HELLO ALICE','你好吗']

1 个答案:

答案 0 :(得分:0)

尝试一下:

@extract.route('/')
def index_page():
    folder = 'templates/topic.aiml'

    with open(folder, 'r') as myfile:
        soup = BeautifulSoup(myfile.read(), 'html.parser')

    data = []
    for cat in soup.find_all('category'):
        if cat.parent.name == "topic": continue
        data += [cat.find("pattern").text]

    print(data)
    return jsonify({'data_set': data})

希望这会有所帮助!查看docs以获得更多示例。