我是Scrapy,HTML和Java的新手。我正在尝试从网站上获取我们代理商的所有分支机构和代理商的列表。我需要的大多数信息都可以从AJAX结果中提取:www.tysonprop.co.za/ajax/agents/?branch_id=[id]
挑战有两个方面:
网站(https://www.tysonprop.co.za/agents/)上显示的分支名称包含在查看页面源时不可见的span元素内。这意味着Scrapy无法找到该信息。例如,“ Tyson Properties Fourways Office”理论上应位于:xpath(// div [@ id =“ select2-result-label-76”] / span [@ class =“ select2-match”] / text( ))[![请参阅检查元素] [1]] [1])
AJAX调用需要分支ID。我无法弄清楚页面如何将下拉列表中选择的分支名称转换为分支ID以拦截逻辑。即如何提取具有相应ID的分支名称列表?
我进行了广泛的网络搜索,但收效甚微。任何帮助,将不胜感激。 [1]:https://i.stack.imgur.com/1kjk8.png
class TysonSpider(scrapy.Spider):
name = 'tyson_spider'
def start_requests(self):
url = 'https://www.tysonprop.co.za/ajax/agents/?branch_id=25'
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
agent = Agent()
json_data = json.loads(response.text)
branch_id = json_data['branch']['id']
branch_name = json_data['branch']['branch_name']
branch_tel = json_data['branch']['get_dialable_telephone_number']
# Loop through all of th agents
agent_list = json_data['agents']
for key in range(len(agent_list)):
agent['id'] = agent_list[key]['id']
agent['branch_id'] = branch_id
agent['branch_name'] = branch_name
agent['branch_tel'] = branch_tel
agent['privy_seal_url'] = agent_list[key]['privy_seal_url']
相关问题:Scrapy xpath not extracting div containing special characters <%=
答案 0 :(得分:0)
如果您查看页面源代码,则可以看到分支ID和名称出现在HTML中,并且位于“ name =“ agent_search””下。 按照下面的逻辑,您将遍历不同的分支并获取其ID和名称:
branches_xpath = response.xpath('//*[@name="agent_search"]//option')
for branch_xpath in branches_xpath[1:]: # skip first option as that one is empty
branch_id = branch_xpath.xpath('./@value').get()
branch_name = branch_xpath.xpath('./text()').get()
print(f"branch_id: {branch_id}, branch_name: {branch_name}")