如何在使用 selenium 抓取数据时跳过 <a> 标签

时间:2021-02-20 14:03:44

标签: python-3.x selenium selenium-webdriver

HTML:

<tbody>
       <tr >
           <td> Tim Cook </td>
           <td class="wpsTableNrmRow" > Apple CEO
               <a href:applicatiodetailaddress> all CEOs </a> // Nor required this node
           </td>
       </tr>
       <tr >
           <td> Sundar Pichai </td>
           <td class="wpsTableNrmRow" > Google CEO </td>
       </tr>
       <tr >
           <td> NoCompany </td>
           <td class="wpsTableNrmRow" > NOT, DEFINED</td>
       </tr>
</tbody>

代码:

applicationData = [td.text for td in webBrowser.find_elements_by_xpath('//td[@class="wpsTableNrmRow"]')]
record = {'Designation': applicationData[0],
 'Designation': applicationData[1],'Designation': applicationData[2]}

输出:

 Designation: Apple CEO all CEOs  // Not required 'all CEOs'
 Designation: Google CEO
 Designation: Not, DEFINED

我正在从表中抓取数据,

我该怎么做?

我试过[td.get_attribute("textContent").split("\n")[0] for td in webBrowser.find_elements_by_xpath('//td[@class="wpsTableNrmRow" and text()!=" "]')]

输出:

 Designation: Apple CEO  
 Designation: Google CEO
 Designation:           // should have value 'NOT, DEFINED'

如何获取价值?

1 个答案:

答案 0 :(得分:1)

applicationData = [td.get_attribute("textContent").split("\n")[0] for td in webBrowser.find_elements_by_xpath('//td[@class="wpsTableNrmRow"]')]
record = {'Designation1': applicationData[0], 'Designation2': applicationData[1]}

试试上面的代码,这里我们使用TextCONtent,它在不同的行返回不同的文本节点,所以你可以使用“\n”来分割它