I am currently scraping a website tat provides a table of data. The structure would be as follows
<table>
<tr> #This is the first row
<td> data 1 </td>
.....
</tr>
....
</table>
Let's say in the end there is a table with 20 rows and 10 columns. My script has to go from one table to the next, being between 100 and 1000 tables.
So, with xpath I locate each row, insert its data in 2 tables, and go to the next one. A pseudocode would be
for table in tables: #Between 100 and 1000 tables
for row in table:
Here I get from the row each td tag and returns a list
Insert in table 1 half of the data, and get the id of the row inserted
insert in table 2 the other half with the id of the first table row, to link both.
I´ve been timing it to see why and where this takes that long and I got the following
Overall table time 16 seconds
Getting the data and generating the list for one row 0,453 secs
Inserting data in table 1 0,006 secs
Inserting data in table 2 0,0067 secs
This means that if I have to scrape all 1000 tables that would take me more than 10 hours, which is way too much time, considering that when I used beautiful soup overall time was between half an hour and 1:30h.
Seeing the problem is in getting text data from each td tag in each row, is there any way to speed it up? Esentially what I am doing in that part of the script is
data_in_position_1=row.find_element_by_xpath('.//td[1]').text
.....
data_in_position_15=row.find_element_by_xpath('.//td[15]').text
list=[data_in_position_1,.....,data_in_position_15]
return list
Well, I don´t know if scraping the whole table at once, or a different approach will show different results, but I need some way to speed this up.
Thanks