我在数据库中搜索了大约1850个html文章的子集,并尝试解析它们以找到四(4)个限定符:设备ID,位置,检查器和注释。我有两个解决方案让我分道扬but,但我遇到的问题是最后一个组件是循环数据并返回每篇文章的唯一信息(4个限定符)。请滚动到底部以获取子集。
此代码为一个限定符提供唯一信息 - >
import urllib
from bs4 import BeautifulSoup
fname = raw_input("Enter file name: ")
if len(fname) < 1 : fname = "badhtmlsubset.txt"
hand = open(fname).read()
soup = BeautifulSoup(hand, "html.parser")
i = 1
for stuff in soup.findAll(text="Equipment ID:"):
print i
print "Equipment ID:", stuff.findNext('td').text,
#print "Location", stuff.find(text="Location:").findNext('td') <--Traceback TypeError: find() takes no keyword arguments
i = i + 1
此代码返回以下内容,但我无法获得位置,检查器或注释。 设备编号:V-2 3 设备编号:79井
此代码打印出正确的格式,但只是反复重复相同的信息,并不是唯一的。
import urllib
from bs4 import BeautifulSoup
fname = raw_input("Enter file name: ")
if len(fname) < 1 : fname = "badhtmlsubset.txt"
hand = open(fname).read()
soup = BeautifulSoup(hand, "html.parser")
#This code prints out the correct format, but does not print unique information for each loop. Just repeats the same information over and over.
i=1
for stuff in soup.findAll(text="Equipment ID:"):
print "Count=", i
equipid = soup.find(text="Equipment ID:").findNext('td')
location = soup.find(text="Location:").findNext('td')
inspector = soup.find(text="Inspector:").findNext('td')
body = soup.find(text="Comments:").findNext('td')
print "Equipment ID:", equipid.text,"Location:", location.text,"Inspector:", inspector.text
print "Comments:", body.text
i = i + 1
我希望获得迭代数据的以下输出:
Equipment ID: Well 53
Location: NMWSS
Inspector: Bob Bobberson
Comments: THE SHELL AND BOTTOM HEAD HAVE PITTING AND GENERAL CORROSION THAT IS BELOW
THE T MIN FOR THE DESIGN PRESSURE OF THIS VESSEL AS AN ALTERNATIVE TO KEEP THE
VESSEL IN SERVICE A NEW T MIN FOR THE SHELL AND HEADS CAN BE ASSUMED. THE
DEEPEST PITS COULD BE REPAIRED AND THE SHELL T MIN SET AT
0.400É?ù AND THE BOTTOM HEAD T MIN SET AT 0.640É?ù
WHICH WOULD GIVE A SLIGHT AMOUNT OF CORROSION ALLOWANCE. AT THESE NEW VALUES
THE VESSEL COULD BE OPERATED AT 92 PSI MAWP. THIS WOULD BE AT THE OWNERS
DISCRETION. IT APPEARS THAT THE PRV ON THIS VESSEL IS SET AT 50 PSI.
此处的子集代表2篇文章:
我将此命名为badhtmlsubset.txt
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta name="generator" content=
"HTML Tidy for Linux (vers 25 March 2009), see www.w3.org" />
<title></title>
</head>
<body>
"
<p text-align:=""><img src="" alt="" panfee="" softenever="" width="" height="" /></p>
<table width:="" border="1">
<tbody>
<tr>
<td width:="">Field:</td>
<td>Pan Fee</td>
</tr>
<tr>
<td>Location:</td>
<td>NMWSS</td>
</tr>
<tr>
<td>Equipment ID:</td>
<td>V-2</td>
</tr>
<tr>
<td>Date:</td>
<td>07/17/2009</td>
</tr>
<tr>
<td>Inspector:</td>
<td>Ray Rankin</td>
</tr>
<tr>
<td rowspan="">Report(s):</td>
<td>
<p>{rsfiles
path=""data/pan_fee/field/api510/softener_v_2/2009/v2_summary_071709.pdf""}</p>
<p>{rsfiles
path=""data/pan_fee/field/api510/softener_v_2/2009/v2_data_071709.pdf""}</p>
<p>{rsfiles path=""data/pan_fee/field/api510/softener_v_2/2009/v1_calcs.pdf""}
(same calcs as V-1)</p>
<p>{rsfiles
path=""data/pan_fee/field/api510/softener_v_2/2009/v2_acad_071709.pdf""}</p>
</td>
</tr>
<tr>
<td>{rsfiles
path=""data/pan_fee/field/api510/softener_v_2/2009/v2_u1a.pdf""}</td>
</tr>
<tr>
<td>{rsfiles
path=""data/pan_fee/field/api510/softener_v_2/2009/panfee_v2_ticketclosed_012010.pdf""}</td>
</tr>
<tr>
<td>Comments:</td>
<td>
<p>THE SHELL AND BOTTOM HEAD HAVE PITTING AND GENERAL CORROSION THAT IS BELOW
THE T MIN FOR THE DESIGN PRESSURE OF THIS VESSEL AS AN ALTERNATIVE TO KEEP THE
VESSEL IN SERVICE A NEW T MIN FOR THE SHELL AND HEADS CAN BE ASSUMED. THE
DEEPEST PITS COULD BE REPAIRED AND THE SHELL T MIN SET AT
0.400É?ù AND THE BOTTOM HEAD T MIN SET AT 0.640É?ù
WHICH WOULD GIVE A SLIGHT AMOUNT OF CORROSION ALLOWANCE. AT THESE NEW VALUES
THE VESSEL COULD BE OPERATED AT 92 PSI MAWP. THIS WOULD BE AT THE OWNERS
DISCRETION. IT APPEARS THAT THE PRV ON THIS VESSEL IS SET AT 50 PSI.<br />
<br />
FOR FULL CODE COMPLIANCE THE VESSEL SHALL BE DE-RATED IN ACCORDANCE WITH THE
CALIFORNIA OCCUPOATIONAL SAFETY - PRESSURE VESSEL UNIT <a href="" target=
"">CIRCULAR LETTER PV-2006-2</a> AND <a href="" target="">CIRCULAR LETTER
PV-2001-1</a>.</p>
<p> </p>
<p><strong color:="">OUT OF SERVICE, TYE HAMMOND, 10/05/2009</strong></p>
</td>
</tr>
<tr>
<td>UltraPipe Unit ID</td>
<td>PAN FEE</td>
</tr>
<tr>
<td>UltraPipe Circuit ID</td>
<td>7888</td>
</tr>
</tbody>
</table>
<p> </p>
<p> </p>
<p> </p>
<p> </p>" "
<p text-align:=""><img src="" border="border" /></p>
<p text-align:="">?ˇ</p>
<p text-align:=""><strong>Please select the desired piping, at the Ethel D location,
from the submenu.</strong><br />
<span class="">(future location of data - added for presentation)</span></p>
<p text-align:="">?ˇ</p>
<h2 text-align:=""><span class="">There are <strong class="">134</strong> active wells
listed with DOGGR.</span></h2>" "
<div text-align:="">
{vsig}/etheld/api570/flowlines/well_79_fl/2014{/vsig}
</div>
<p text-align:=""> </p>
<table width:="" border="1">
<tbody text-align:="">
<tr text-align:="">
<td width:="" text-align:="">Field:</td>
<td text-align:="">Ethel D</td>
</tr>
<tr text-align:="">
<td text-align:="">Location:</td>
<td text-align:="">SMWSS</td>
</tr>
<tr text-align:="">
<td text-align:="">Equipment ID:</td>
<td text-align:="">Well 79</td>
</tr>
<tr text-align:="">
<td text-align:="">Inspection Dates:</td>
<td text-align:="">Last: 07/30/2014 - Next: 07/30/2019</td>
</tr>
<tr text-align:="">
<td text-align:="">Inspector:</td>
<td text-align:="">Ronnie Harleston</td>
</tr>
<tr text-align:="">
<td text-align:="" rowspan="">Report(s):</td>
<td text-align:="">
<p><strong><span text-decoration:="">2014 INSPECTION
DATA:</span></strong><br />
{rsfiles
path=""data/etheld/api570/flowlines/well79/2014/etheld_well79_flowline_report_073014.pdf""}</p>
<p>{rsfiles
path=""data/etheld/api570/flowlines/well79/2014/etheld_well79_flowline_ultrapipe_073014.pdf""}</p>
<p>{rsfiles
path=""data/etheld/api570/flowlines/well79/2014/etheld_well79_flowline_field_drawing_073014.pdf""}<br />
<br />
<strong><span text-decoration:="">2009 INSPECTION DATA:</span></strong><br />
{rsfiles
path=""data/etheld/api570/flowlines/well79/2009/well79_summary_073009.pdf""}<br title=""
inspection="" /></p>
<p>{rsfiles
path=""data/etheld/api570/flowlines/well79/2009/well79_data_073009.pdf""}</p>
</td>
</tr>
<tr text-align:="">
<td text-align:="">Find in <a href="" target="">Virtual Tour</a></td>
</tr>
<tr text-align:="">
<td text-align:="">Comments:</td>
<td text-align:="">
<p><span text-decoration:=""><strong>Ultrasonic A-Scan Thickness
Inspection:</strong></span><br />
Ultrasonic A-Scan thickness measurements were completed in accordance with
Applus RTD established procedures. 3 thickness measurement locations (TMLs)
were established and thickness measurements were taken at these locations. All
thickness readings have been entered into Ultra Pipe and after review of the
data next inspection will be 07/30/2019. The predicted retirement date of the
circuit is 07/30/2033 based on calculated corrosion rates and a 2 mil per year
default corrosion rate. There are 0 caution TMLs per the current thickness
survey taken on 07/30/2014.</p>
<p><br />
<span text-decoration:=""><strong>API 570 Visual
Inspection:</strong></span></p>
<p>Visual inspection found this piping circuit to be in fair condition. This
piping circuit externally is insulated. The insulation is found to be in poor
condition, with missing or damage insulation, and in the areas the insulation
is missing the surface condition of the piping is covered with light to
moderate surface rust. The piping supports surface was covered with light to
moderate surface rust and no corrosion was present. Support hangers were found
to be in fair condition. The piping was inspected for code compliance issues
and to identify possible leaks, stresses and any condition that might reduce
the life of the piping circuit. All piping will be put on a maximum 5 year
thickness inspection interval per API (class 2 piping) requirements. Some
piping TML locations may require re-inspection prior to the maximum interval.
See UltraPIPE data reports for all required inspection dates.</p>
<p><br />
Recommendations: Repair insulation as needed.</p>
<p>All piping will require an API 570 Visual inspection in 5 years.</p>
</td>
</tr>
<tr text-align:="">
<td text-align:="">UltraPipe Unit ID</td>
<td text-align:="">ETHEL_D</td>
</tr>
<tr text-align:="">
<td text-align:="">UltraPipe Circut ID</td>
<td text-align:="">WELL 79</td>
</tr>
</tbody>
</table>
<p> </p>
<p text-align:=""> </p>
<p> </p>" beautifulsoup html-parsing
</body>
</html>