我正在使用Python Scrapy从网站上抓取一些数据。该站点有许多表。例如,它有50个州,每个州都有3到5个表,而我只刮了3个表。
table_3 = response.xpath(
'//*[@id="all"]/div[3]/div/div/div[2]/div/div/div/table[3]').extract()
表3的行在3到10之间。
rows = [item for idx, item in enumerate(
table_3) if idx in indices]
用于查找表3是否存在的索引,如果不存在,则不会附加到行
要获取<td>
值,我从行列表中删除了所有不需要的数据。
td = []
for each in rows:
temp = (each.replace('<table class="unwanted date">',
'').replace('<tr>', '').replace('</tr>', '').replace('<td>', '').replace('</td>', '').replace('unwanted date', '').replace('unwanted date', '').replace('\n', '').replace(' ', '').replace('</table>', ''))
td.append(temp.split('%'))
for each in td:
print('The td are', each)
这不能使我获得正确格式的输出,并且该方法无法有效工作。
table_3 = response.xpath(
'//*[@id="all"]/div[3]/div/div/div[2]/div/div/div/table[3]').extract()
rows = [item for idx, item in enumerate(
table_3) if idx in indices]
td = []
for each in rows:
temp = (each.replace('<table class="unwanted date">',
'').replace('<tr>', '').replace('</tr>', '').replace('<td>', '').replace('</td>', '').replace('unwanted date', '').replace('unwanted date', '').replace('\n', '').replace(' ', '').replace('</table>', ''))
td.append(temp.split('%'))
for each in td:
print('The td are', each)
Output: The td are ['$0+4', '$11,230+5.84', '$22,470+6.27', '$247,350+7.65', '']
Expected Output: The td are ['$0+', '$11,230+', '$22,470+', '$247,350+']['4', '5.84', '6.27', '7.65']
我该如何实现?