这是我的代码。我想将我已经删除的数据写入文件。但我只想要文本,而不是标签,它也有所有HTML标签,我不知道如何摆脱它。
import urllib2
from bs4 import BeautifulSoup
file = open("megapy.txt", "w")
file.seek(0)
FullPage = ['New-Arrivals-2017-6', 'Big-Sales-click-here', 'Arduino-Development-boards',
'Robotics-and-Copters']
urlp1 = "http://www.arduinopak.com/Prd.aspx?Cat_Name="
URL = urlp1 + FullPage[0]
for n in FullPage:
URL = urlp1 + n
page = urllib2.urlopen(URL)
bsObj = BeautifulSoup(page, "html.parser")
descList = bsObj.findAll('div', attrs={"class": "panel-default"})
for desc in descList:
print(desc.get_text(separator=u' '))
file.write(desc.prettify("utf-8"))
file.close()
但是,我一直在文本文件中获取此输出:
<div class="panel panel-default">
<div class="panel-heading">
<h5>
2 X 8 FR4 PCB Prototype Circuit board Double Side
</h5>
</div>
<div class="panel-body">
<div class="row">
<div class="col-md-4 pro-image">
<a href="Prd_Detail.aspx?Prd_ID=20246">
<img alt="2 X 8 FR4 PCB Prototype Circuit board Double Side" class="img-thumbnail" src="http://upsats.com/Content/Product/img/Product/Thumb/PCB2x8-.jpg"/>
</a>
</div>