使用python3中的beautifulsoup从另一个嵌套标记内的嵌套标记中提取文本

时间:2017-04-30 16:37:08

标签: html python-3.x beautifulsoup

我有一个html页面,其中有相同的html代码集,具有不同的数据,我需要获取数据“709”。我能够获取tr标签内的所有文本,但我不知道如何进入tr标签并单独获取td标签中的数据。请帮我。下面是html代码。

<table class="readonlydisplaytable">
	<tbody>
		<tr class="readonlydisplayfield">
			<th class="readonlydisplayfieldlabel">Payer Phone #</th>
			<td class="readonlydisplayfielddata">1234</td>
		</tr>
		<tr class="readonlydisplayfield">
			<th class="readonlydisplayfieldlabel">Name</th>
			<td class="readonlydisplayfielddata">ABC SERVICES</td>
		</tr>
		
		<tr class="readonlydisplayfield">
			<th class="readonlydisplayfieldlabel">Package #</th>
			<td class="readonlydisplayfielddata">709</td>
		</tr>
		
		<tr class="readonlydisplayfield">
			<th class="readonlydisplayfieldlabel">Case #</th>
			<td class="readonlydisplayfielddata">n/a</td>
		</tr>
		
		<tr class="readonlydisplayfield">
			<th class="readonlydisplayfieldlabel">Date</th>
			<td class="readonlydisplayfielddata">n/a</td>
		</tr>
		
		<tr class="readonlydisplayfield">
			<th class="readonlydisplayfieldlabel">Adjuster</th>
			<td class="readonlydisplayfielddata">n/a</td>
		</tr>
		
		<tr class="readonlydisplayfield">
			<th class="readonlydisplayfieldlabel">Adjuster Phone #</th>
			<td class="readonlydisplayfielddata">n/a</td>
		</tr>
		
		<tr class="readonlydisplayfield">
			<th class="readonlydisplayfieldlabel">Adjuster Fax #</th>
			<td class="readonlydisplayfielddata">n/a</td>
		</tr>
		
		<tr class="readonlydisplayfield">
			<th class="readonlydisplayfieldlabel">Body Part</th>
			<td class="readonlydisplayfielddata">n/a</td>
		</tr>
		
		<tr class="readonlydisplayfield">
			<th class="readonlydisplayfieldlabel">Deadline</th>
			<td class="readonlydisplayfielddata">11/22/2014</td>
		</tr>			
	</tbody>
</table>

以下是我使用的代码。

from selenium import webdriver
import os, time, csv, datetime
from selenium.webdriver.common.keys import Keys
import threading
import multiprocessing
from selenium.webdriver.support.select import Select
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
import openpyxl
from bs4 import BeautifulSoup
import urllib.request
import pandas as pd


soup = BeautifulSoup(open("C:\\Users\\mapraveenkumar\\Desktop\\phonepayor.htm"), "html5lib")
a = soup.find_all("table", class_="readonlydisplaytable")
for b in a:
    c = b.find_all("tr", class_="readonlydisplayfield")
    for d in c:
        if "Package #" in d.get_text():
            print(d.get_text())

2 个答案:

答案 0 :(得分:1)

您希望td元素内的文本与包含“Package#”的th元素相邻。我首先寻找那个,然后我找到它的父母和父母的兄弟姐妹。像往常一样,当我试图阐明如何捕获我想要的内容时,我发现在交互式环境中工作最容易。我怀疑主要的一点是find_allstring=一起使用。

>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(open('temp.htm').read(),'lxml')
>>> target = soup.find_all(string='Package #')
>>> target
['Package #']
>>> target[0].findParent()
<th class="readonlydisplayfieldlabel">Package #</th>
>>> target[0].findParent().fetchNextSiblings()
[<td class="readonlydisplayfielddata">709</td>]
>>> tds = target[0].findParent().fetchNextSiblings()
>>> tds[0].text
'709'

答案 1 :(得分:0)

html = '''code above (html'''
soup = bs(html,'lxml')

find_tr = soup.find_all('tr') #Iterates through 'tr'
for i in find_tr:
    for j in i.find_all('th'): #iterates through 'th' tags in the 'tr'
        print(j)
    for k in i.find_all('td'): #iterates through 'td' tags in 'tr'
        print(k)

这应该可以胜任。我们创建一个遍历每个 TR 标记的for循环 并且对于tr标签示例的EACH值(我们将创建2个循环来查找所有th和td标记:

<tr class="readonlydisplayfield">
        <th class="readonlydisplayfieldlabel">Payer Phone #</th>
        <td class="readonlydisplayfielddata">1234</td>
</tr>

现在,如果有超过1个 td 标记,这也会有用。 对于一个标签(td,th)使用,我们可以执行以下操作:

find_tr = soup.find_all('tr') #finds all tr
for i in find_tr: #Goes through all tr
    print(i.th.text) # the .th will gives us the th tag from one TR
    print(i.td.text) # .td will return the  td.text value.