硒,Python Web抓取

时间:2020-10-19 14:31:06

标签: python html selenium-webdriver web-scraping

我正在尝试从HTML表中提取数据。 成功计数了行,但是当我打印时,它会不断重复行。 谁能告诉我代码有什么问题吗? 谢谢。

#counting length of row
rows = len(driver.find_elements_by_xpath('/html/body/form/fieldset/table[2]/tbody/tr/td[3]/table/tbody/tr[5]/td[2]/div/table[1]/tbody/tr[2]/td[1]/table[2]/tbody/tr'))
time.sleep(2)
print(rows)

for r in range(rows):
    value=driver.find_element_by_xpath('/html/body/form/fieldset/table[2]/tbody/tr/td[3]/table/tbody/tr[5]/td[2]/div/table[1]/tbody/tr[2]/td[1]/table[2]/tbody/tr["+str(r)+"]')
    print(value.text)


#Output:
18 #no of rows
Start of legal relation2/7/2018 #1st row
Start of legal relation2/7/2018
Start of legal relation2/7/2018
Start of legal relation2/7/2018
Start of legal relation2/7/2018
Start of legal relation2/7/2018
Start of legal relation2/7/2018
Start of legal relation2/7/2018
Start of legal relation2/7/2018
Start of legal relation2/7/2018
Start of legal relation2/7/2018
Start of legal relation2/7/2018
Start of legal relation2/7/2018
Start of legal relation2/7/2018
Start of legal relation2/7/2018
Start of legal relation2/7/2018
Start of legal relation2/7/2018
Start of legal relation2/7/2018
sample test case successfully completed

1 个答案:

答案 0 :(得分:0)

没有提供的URL,很难说出原因。但是,第一个tr元素应该是[1],所以我认为您的range函数应该是range(1, rows + 1)。而且您执行此操作的方式似乎非常间接,因为您的第一个查询似乎已检索到所有要查找的元素。那为什么不只是以下内容?

elements = driver.find_elements_by_xpath('/html/body/form/fieldset/table[2]/tbody/tr/td[3]/table/tbody/tr[5]/td[2]/div/table[1]/tbody/tr[2]/td[1]/table[2]/tbody/tr')
#time.sleep(2) # what does this accomplish?
print(len(elements))

text_list = [element.text for element in elements] # list of strings