如何为scrapy为xpath编写正则表达式?

时间:2018-09-10 12:04:06

标签: python-3.x web-scraping scrapy scrapy-spider

我是新手,还是用它来查找网页上的问题和答案。我从这个page

开始

我通过观察选择器的xpath来尝试使用这种选择器:

sel = Selector(text=response.body)
spanList = (sel.xpath('//a/span').extract())

但是这样做时我得到一些重复的内容,我以这种方式得到输出

"<span>How do I access my account online at Citibank Online?</span>",
"<span>What are the guidelines for creating an internet password?</span>",
"<span>I forgot my User ID for accessing my account online. How do I access my account online now?</span>",
"<span>How do I transfer funds to another bank account in India?</span>",
"<span>How do I transfer funds to my Rupee Checking Account from overseas?</span>",
"<span>How do I transfer funds from my Rupee Checking Account to my local bank account overseas?</span>",
"<span>How do I update my contact information?</span>",
"<span>I have not operated my Rupee Checking Account for a long time and I plan to visit India. Can I transact on my account when I visit India?</span>",
"<span>My Term Deposits with Citibank are due to mature soon. What do I need to do?</span>",
"<span>I would like to terminate my Term Deposits before maturity? Will I lose any money?</span>",
"<span>Why do I need to provide \"Customer Profile Update\" forms so often?</span>",
"<span>How do I access my account online at Citibank Online?</span>",
"<span>What are the guidelines for creating an internet password?</span>",
"<span>I forgot my User ID for accessing my account online. How do I access my account online now?</span>",
..................

如果您观察到我发布的输出部分,则会再次重复第一个和第三个跨度。

有什么方法可以编写一个好的正则表达式来获取内容而无需重复。

我提到的页面中问题的样本xpath是

  

/ html / body / div 1 / div [2] / div [3] / div [2] / div / div [2] / div / div [3] / div 1 / div [3] / div 1 / a / span

     

/ html / body / div 1 / div [2] / div [3] / div [2] / div / div [2] / div / div [3] / div 1 / div [5] / div [5] / div 1 / a / span

     

/ html / body / div 1 / div [2] / div [3] / div [2] / div / div [2] / div / div [3] / div 1 / div [5] / div 1 / div 1 / a / span

1 个答案:

答案 0 :(得分:1)

查看此内容

points = response.xpath('//*[@class="ClsInnerDrop"]//span/text()').extract()
pointes = set(points)