我仍然是初学者,正在学习Scrapy
因此,我正在使用Scrapy脚本在rumah123.com上准确地在https://www.rumah123.com/en/sale/surabaya/surabaya-kota/all-residential/处抓取大量链接,事实证明成功了!它产生链接的csv
但是当我更改https://www.rumah123.com/en/rent/surabaya/surabaya-kota/all-residential/的链接时,我的Scrapy脚本什么也没产生
当我运行脚本时,Scrapy Log准确地说:
2019-10-18 13:02:05 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.rumah123.com/en/rent/surabaya/surabaya-kota/all-residential/> (referer: None)
2019-10-18 13:02:05 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.rumah123.com/en/rent/surabaya/surabaya-kota/all-residential/?page=2> (referer: None)
2019-10-18 13:02:05 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.rumah123.com/en/rent/surabaya/surabaya-kota/all-residential/?page=6> (referer: None)
2019-10-18 13:02:05 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.rumah123.com/en/rent/surabaya/surabaya-kota/all-residential/?page=5> (referer: None)
2019-10-18 13:02:05 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.rumah123.com/en/rent/surabaya/surabaya-kota/all-residential/?page=7> (referer: None)
2019-10-18 13:02:05 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.rumah123.com/en/rent/surabaya/surabaya-kota/all-residential/?page=4> (referer: None)
2019-10-18 13:02:05 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.rumah123.com/en/rent/surabaya/surabaya-kota/all-residential/?page=8> (referer: None)
2019-10-18 13:02:06 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.rumah123.com/en/rent/surabaya/surabaya-kota/all-residential/?page=9> (referer: None)
2019-10-18 13:02:06 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.rumah123.com/en/rent/surabaya/surabaya-kota/all-residential/?page=10> (referer: None)
2019-10-18 13:02:16 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.rumah123.com/en/rent/surabaya/surabaya-kota/all-residential/?page=3> (referer: None)
2019-10-18 13:02:16 [scrapy.core.engine] INFO: Closing spider (finished)
但是当我检查真实的csv时,它里面什么都没有!
这是脚本的全部代码:
class Rumah123_Spyder(scrapy.Spider):
name = "Home_Rent"
url_list = []
page = 1
def start_requests(self):
headers = {
'accept-encoding': 'gzip, deflate, sdch, br',
'accept-language': 'en-US,en;q=0.8,zh-CN;q=0.6,zh;q=0.4',
'upgrade-insecure-requests': '1',
'user-agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36',
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'cache-control': 'max-age=0',
}
#base = 'https://www.rumah123.com/en/sale/surabaya/surabaya-kota/all-residential/'
base = 'https://www.rumah123.com/en/rent/surabaya/surabaya-kota/all-residential/'
for x in range(10): #depends on number of page in search results
if x==0:
yield scrapy.Request(url=base, headers=headers, callback=self.parse)
self.page += 1
else:
yield scrapy.Request(url=base + "?page=" + str(self.page), headers=headers, callback=self.parse)
self.page += 1
#Filter a not valid URL
self.url_list = [rum for rum in self.url_list if "/property/" in rum]
for x in range(len(self.url_list)):
self.url_list[x] = "rumah123.com" + self.url_list[x]
url_df = pd.DataFrame(self.url_list, columns=["Sub URL"])
#url_df.to_csv("home_sale_link.csv", encoding="utf_8_sig")
url_df.to_csv("home_rent_link.csv", encoding="utf_8_sig")
def parse(self, response):
for rumah in response.xpath('//a/@href'):
if rumah.get() not in self.url_list:
self.url_list.append(rumah.get())
from scrapy import cmdline
cmdline.execute("scrapy runspider Rumah123_url.py".split())
“预期结果”就像在“ URL首次尝试”中一样,下面是链接的屏幕截图:
“租金” URL的当前结果为空,这是屏幕截图:
额外的注意:我测试过使用刮板外壳https://www.rumah123.com/en/sale/surabaya/surabaya-kota/all-residential/来运行,如果我手动运行代码,它可以直接生成CSV,但是通过一对一运行代码会很累人:((
有人可以指出为什么会这样吗?谢谢:)
答案 0 :(得分:1)
提取蜘蛛中的网址 进口沙哑
class QuotesSpider(scrapy.Spider):
name = "quotes"
start_urls = ['https://www.rumah123.com/en/rent/surabaya/surabaya-kota/all-residential/?page=' + str(i) for i in range(1, 10)]
def parse(self, response):
for quote in response.xpath('//*[@class="sc-bRbqnn iRnfmd"]'):
yield {
'url1': quote.xpath('a/@href').extract(),
}
存储抓取数据的最简单方法是使用Feed导出,并使用以下命令:
scrapy crawl quotes -o 1.csv
答案 1 :(得分:0)
问题是您的蜘蛛没有产生任何东西。
您可以尝试以下parse
方法
def parse(self, response):
for rumah in response.xpath('//a/@href'):
if rumah.get() not in self.url_list:
self.url_list.append(rumah.get())
yield {'result': self.url_list}
答案 2 :(得分:0)
我发现了问题!通过将循环从10更改为更大的值(例如30),“我的CSV”现在被URL列表填充了!虽然我真的不知道为什么会这样
for x in range(30): #depends on number of page in search results
if x==0:
yield scrapy.Request(url=base, headers=headers, callback=self.parse)
self.page += 1
else:
yield scrapy.Request(url=base + "?page=" + str(self.page), headers=headers, callback=self.parse)
self.page += 1