我无法在页面中找到某些文字。最大的部分是因为文本的位置在页面之间变化。
如果我能在包含关键字" Camp Director"
的行之后获取一些帮助来提取行文本html示例:
<div class="span4">
<strong>Camp Director : </strong>
<span>Camp Directors Name</span>
</div>
我正在玩这个:
def parse1(self, response):
hxs = Selector(response)
titles = hxs.xpath('//*[@id="fullwidth-container"]')
body = hxs.xpath('/html/body')
items = []
for titles in titles:
item = BayItem()
item["director"] = "".join(response.css('#fullwidth-container > div > div > div.geobase.complex-module-container.module > div.geobase-listing > div > div.premium.row-fluid.complex-module-columns-container > div.span8.respond-container.main-block > div.custom-field.geobase-cf-text > div:nth-child(4) > div:nth-child(3) > span').extract())
item["director1"] = titles.xpath('//*[@id="fullwidth-container"]/div/div/div[1]/div[1]/div/div[2]/div[1]/div[3]/div[3]/div[2]/span').extract()
item["director2"] = titles.xpath('//*[@id="fullwidth-container"]/div/div/div[1]/div[1]/div/div[2]/div[1]/div[4]/div[3]/div[2]/span').extract()
item["director3"] = titles.xpath('//*[@id="fullwidth-container"]/div/div/div[1]/div[1]/div/div[2]/div[1]/div[5]/div[4]/div[2]/span').extract()
item["director4"] = re.findall('Camp Director(\*)', response.body)
converter = html2text.HTML2Text()
converter.ignore_links = True
items.append(item)
return items
我有点倾向于我可能必须使用更多正则表达式,但我不确定如何使用它。非常感谢帮助人员!
答案 0 :(得分:0)
只要这种格式一致(意味着在“Camp Director”行后面有一个新行),这应该适合你:
regex = Camp \ sDirector。+ strong&gt; \ n \ s *(。*)
这会抓住下一行的文字。
答案 1 :(得分:0)
感谢您的帮助!我找到了我需要的答案。在Extract text based on previous and next sibling
的@AmericanMade和@Dimitre Novatchev的帮助下最终代码是:
def parse1(self, response):
hxs = Selector(response)
titles = hxs.xpath('//*[@id="fullwidth-container"]')
items = []
for titles in titles:
item = BayItem()
item["director"] = response.xpath('//div[contains(text(), "Camp Director : ")]/following-sibling::text()')
converter = html2text.HTML2Text()
converter.ignore_links = True
items.append(item)
return items
原始
item [“director”] = response.xpath('// div [contains(text(),“Camp Director:”)] / following-sibling :: text()')