I'm trying to crawl the values of a dynamic table using scrapy-splash and export them to a json/excel/something.
In order to load the values i have to click several buttons, but i can't find a way to do it, got to admit i know little of crawling.
The HTML of the buttons looks like these:
<ul>
<li>
<a href="#">1</a>
</li>
<li>
<a href="#">2</a>
</li>
<li>
<a href="#">3</a>
</li>
<li>
<a href="#">4</a>
</li>
<li>
<span>...</span>
</li>
<li>
<a href="#">10</a>
</li>
</ul>
Whenever you push one of them, the contents on the table changes, and also the numbers above.
I want to click on them all, one by one, extract the values of the table and save it into an excel/json.
A small idea of it would be:
import scrapy
from scrapy_splash import SplashRequest
class Extractor(scrapy.Spider):
name = 'extractor_spider'
def start_requests(self):
yield SplashRequest( #Splash
url='url',
callback=self.parse,
)
def parse(self, response):
selectors = response.xpath('//ul/li') #extract the path of the selectors
for sel in selectors: #Just a test, doesn't store all the selectors
# sel.click() #Obviously doesn't work, but the idea is to click here and load the values of each page
### CODE TO EXTRACT DATA AND WRITE TO CSV ###
I've tried reloading the web, by using the inspect tool in Chrome i saw in Network that when you click into the button it sends a GET request, I've tried to emulate the Request, but with no success:
def parse(self, response):
### CODE TO EXTRACT DATA AND WRITE TO CSV ###
link = "https://url.com" + codeRequest
yield SplashRequest(link, self.parse)
¿Any tip or idea to get this done?
I've also thought on using selenium but there's no easy way to locate the buttons because they're and doesn't have any distictive id or name to them.
Also, I believe it would be harder to crawl the pages with selenium because the number of pages in the table/s is unknown.
To be more specific, the web i want to crawl is Malwr, the "Behavioral Analysis" field and the table/s bellow.
Like the one in this link(random link, not one of the ones I want to crawl): https://malwr.com/analysis/MmFlMTBkOTA1MGVjNGI5ZGE1M2E3YjQwYzAxYTNjZjc/