如何通过lxml检索标签标记内的文本?

时间:2018-10-11 08:13:00

标签: parsing web-scraping lxml lxml.html

我正在使用lxml来获取标记内的文本,并以此方式

  xpaths_for_questions_lxml = []
    for tag in self.tree.iter():
        try:
            if tag.text and utils.is_question(tag.text.strip()):
                xpaths_for_questions_lxml.append(self.tree.getpath(tag))

        except Exception as e:
            self.logger.debug(traceback.format_exc())
            raise Exception
  如果语句带有问号,

is_question模块将返回true

但是,当标记类型为 label 时,tag.text属性为空,即使实际webpage的label标记内有文本,也不会显示任何文本。

为什么标签标签未显示任何文本内容?还是需要做其他任何事情来获取标签标签?

EDIT1:我的问题是,我正在遍历dom树中的所有子项,但是为什么标签内的文本没有显示出来?

1 个答案:

答案 0 :(得分:1)

如果您想提出问题,可以尝试

r = requests.get('https://www.amctheatres.com/faqs/movie-info')
source = html.fromstring(r.text)
questions = source.xpath('//label[@itemprop="text"]/text()')

questions = [label.text_content() for label in source.xpath('//label[@itemprop="text"]')]

请注意,由于label.text_content()节点包含多个子文本节点,因此应使用label.text而不是label

print(questions)
#['Does the runtime shown for each movie include trailers?', 'Where can I find MPAA movie ratings information?', 'What does advertised showtime mean?', 'What movies are playing right now at AMC?', 'What movies are coming soon to AMC?', 'How can I find movie times at AMC?']