我有一个html页面,其中有相同的html代码集,具有不同的数据,我需要获取数据“709”。我能够获取tr标签内的所有文本,但我不知道如何进入tr标签并单独获取td标签中的数据。请帮我。下面是html代码。
<table class="readonlydisplaytable">
<tbody>
<tr class="readonlydisplayfield">
<th class="readonlydisplayfieldlabel">Payer Phone #</th>
<td class="readonlydisplayfielddata">1234</td>
</tr>
<tr class="readonlydisplayfield">
<th class="readonlydisplayfieldlabel">Name</th>
<td class="readonlydisplayfielddata">ABC SERVICES</td>
</tr>
<tr class="readonlydisplayfield">
<th class="readonlydisplayfieldlabel">Package #</th>
<td class="readonlydisplayfielddata">709</td>
</tr>
<tr class="readonlydisplayfield">
<th class="readonlydisplayfieldlabel">Case #</th>
<td class="readonlydisplayfielddata">n/a</td>
</tr>
<tr class="readonlydisplayfield">
<th class="readonlydisplayfieldlabel">Date</th>
<td class="readonlydisplayfielddata">n/a</td>
</tr>
<tr class="readonlydisplayfield">
<th class="readonlydisplayfieldlabel">Adjuster</th>
<td class="readonlydisplayfielddata">n/a</td>
</tr>
<tr class="readonlydisplayfield">
<th class="readonlydisplayfieldlabel">Adjuster Phone #</th>
<td class="readonlydisplayfielddata">n/a</td>
</tr>
<tr class="readonlydisplayfield">
<th class="readonlydisplayfieldlabel">Adjuster Fax #</th>
<td class="readonlydisplayfielddata">n/a</td>
</tr>
<tr class="readonlydisplayfield">
<th class="readonlydisplayfieldlabel">Body Part</th>
<td class="readonlydisplayfielddata">n/a</td>
</tr>
<tr class="readonlydisplayfield">
<th class="readonlydisplayfieldlabel">Deadline</th>
<td class="readonlydisplayfielddata">11/22/2014</td>
</tr>
</tbody>
</table>
以下是我使用的代码。
from selenium import webdriver
import os, time, csv, datetime
from selenium.webdriver.common.keys import Keys
import threading
import multiprocessing
from selenium.webdriver.support.select import Select
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
import openpyxl
from bs4 import BeautifulSoup
import urllib.request
import pandas as pd
soup = BeautifulSoup(open("C:\\Users\\mapraveenkumar\\Desktop\\phonepayor.htm"), "html5lib")
a = soup.find_all("table", class_="readonlydisplaytable")
for b in a:
c = b.find_all("tr", class_="readonlydisplayfield")
for d in c:
if "Package #" in d.get_text():
print(d.get_text())
答案 0 :(得分:1)
您希望td
元素内的文本与包含“Package#”的th
元素相邻。我首先寻找那个,然后我找到它的父母和父母的兄弟姐妹。像往常一样,当我试图阐明如何捕获我想要的内容时,我发现在交互式环境中工作最容易。我怀疑主要的一点是find_all
与string=
一起使用。
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(open('temp.htm').read(),'lxml')
>>> target = soup.find_all(string='Package #')
>>> target
['Package #']
>>> target[0].findParent()
<th class="readonlydisplayfieldlabel">Package #</th>
>>> target[0].findParent().fetchNextSiblings()
[<td class="readonlydisplayfielddata">709</td>]
>>> tds = target[0].findParent().fetchNextSiblings()
>>> tds[0].text
'709'
答案 1 :(得分:0)
html = '''code above (html'''
soup = bs(html,'lxml')
find_tr = soup.find_all('tr') #Iterates through 'tr'
for i in find_tr:
for j in i.find_all('th'): #iterates through 'th' tags in the 'tr'
print(j)
for k in i.find_all('td'): #iterates through 'td' tags in 'tr'
print(k)
这应该可以胜任。我们创建一个遍历每个 TR 标记的for循环 并且对于tr标签示例的EACH值(我们将创建2个循环来查找所有th和td标记:
<tr class="readonlydisplayfield">
<th class="readonlydisplayfieldlabel">Payer Phone #</th>
<td class="readonlydisplayfielddata">1234</td>
</tr>
现在,如果有超过1个 td 或 标记,这也会有用。 对于一个标签(td,th)使用,我们可以执行以下操作:
find_tr = soup.find_all('tr') #finds all tr
for i in find_tr: #Goes through all tr
print(i.th.text) # the .th will gives us the th tag from one TR
print(i.td.text) # .td will return the td.text value.