Question

我想选择具有两个条件的元标记的xpath。通常它是这样的：

//div[@id='..' and @class='...]

但是，我要提取的元标记如下：

< meta name="Description" xml:lang="en" content="Some text which I want to extract.">

我尝试过：

extract_with_xpath('//meta[@name="Description" and @xml:lang="en"]/@content')

也：

extract_with_xpath('//meta[@name="Description" and (@xml:lang="en")]/@content')

也尝试了其他几种选择，但没有一个起作用。

有人知道如何解决这个问题吗？

Answer 1

标记id | user_id | created_at_a ---------------------------------------- 1 | 1 | 2019-01-24 12:20:00 UTC 2 | 1 | 2019-01-25 01:04:00 UTC 4 | 1 | 2019-01-25 01:03:00 UTC 5 | 1 | 2019-01-24 12:22:00 UTC 6 | 2 | 2019-01-24 20:48:00 UTC 7 | 2 | 2019-01-24 20:49:00 UTC 8 | 2 | 2019-01-24 11:21:00 UTC中有空格，因此我也没有成功从中提取数据。但是您可以尝试：

< meta

Answer 2

观察您的网站后，meta标记实际上是：

<meta name="DC.Description" xml:lang="en" content="some text">

要提取内容，请使用以下xpath：

d_x = '//meta[@name="DC.Description"]'

此外，同一选择器下还有多个元标记。 xml:lang是区分内容的属性，但是xpaths或css不能使用定界符处理这种属性。您必须这样做：

desc_metas = response.xpath(ds_x)    #Extract only metas with description
filter_desc = []
for d in desc_metas:
    filter_desc.append(d.replace('xml:lang', 'lang'))    #Replacing xml:lang with lang only so that our selector can detect that

现在获取相应的语言描述，例如； 'en'

en_desc = None
for d in filter_desc:
   d = Selector(text=d)    #converting back, string to Selector
   if d.xpath('//meta[@lang="en"]/@content'):    #now Applying lang attribute to get the desired content.

        en_desc = d.xpath('//meta[@lang="en"]/@content')

使用多种条件从带有Xpath的元标记中提取内容

2 个答案: