Question

我试图在特定关键字/字符串出现后抓取内容。

假设Xpath如下：

   <meta property="og:url" content="https://www.example.com/tshirt/pcid111-31">
   <meta property="og:url" content="https://www.example.com/tshirt/pcid3131-33">
   <meta property="og:url" content="https://www.example.com/tshirt/pcid545424524-84">

1）如何提取content

property="og:url元素内的所有数据

2）我还想提取pcid之后的任何内容，有人可以提出解决方法吗？

现在确定这是否有效：

项[＆＃34;示例＆＃34;] = sel.xpath（＆＃34; // meta [@property =＆＃39; og：url＆＃39;] / text（）＆＃34;） .extract（）[0] .replace（＆＃34; * PCID＆＃34;＆＃34;＆＃34）

替换是否采用通配符引用？

Answer 1

试试这个

x=len(hxs.select("//meta/@content").extract())

for i in range(x):
    print    hxs.select("//meta/@content").extract()[i].split('pcid')[1]

输出：

111-31

3131-33

545424524-84

Answer 2

这将提取content

的元素的property="og:url"属性

og_urls = response.xpath("//meta[@property='og:url']/@content").extract()

为了从网址中提取内容，通常最好使用正则表达式，在您的情况下，它将是：

for url in og_urls:
   id = re.findall("pcid(.+)")  # "pcid(.+)" = any characters after 'pcid'(greedy)
   # re.findall() returns a list and you probably want only the first occurrence and there mostlikely only be one anyway
   id = id[0] if id else ''  
   print(id)

或者您可以将网址拆分为'pcid'并取较后的值，例如

for url in og_urls:
   id = url.split('pcid')[-1]
   print(id)

在特定关键字/字符串后使用Scrapy来抓取内容

2 个答案: