我在stackoverflow上彻底搜索过但找不到合适的解决方案。我正在抓取旧网站,我想提取所有标签和输入名称。旧网页的html格式是这样的
<div class="labellong">First Name</div>
<INPUT class="input-l" name="firstname">
<div class="labellong">Last Name</div>
<INPUT class="input-l" name="lastname">
<div class="labellong">Gender</div>
<input type="radio" name="gender" value="male"> Male<br>
<input type="radio" name="gender" value="female"> Female<br>
<table>
<tr valign="top">
<td width="174">User Name</td>
<td width="888"><input name="username" value="" id="username" class="input-m" /></td>
</tr>
<tr>
<td width="174">User Account</td>
<td width="888"><input name="useraccount" value="" id="uaseraccount" class="input-m" /></td>
</tr>
</table>
我想使用带有beautifulsoup的python提取输出
First Name, firstname
Last Name, lastname
Gender, gender
User Name, username
User Account, useraccount
我确实试过findall方法但是失败了,因为我需要标签(文本)和输入标签名称,是否有解决方案来废弃带有标签文本的多个html标签?感谢
我是网页抓取的新手,这是我尝试过的代码
from bs4 import BeautifulSoup
import urllib.request as urllib2
f = open("g:\output.txt", "w")
errFile = open("g:\error.txt", "w")
url = "file:///g://pharmacy.htm"
file = urllib2.urlopen(url)
soup = BeautifulSoup(file)
for message1 in soup.findAll(["div", {"class": "labellong"}, "input", {"class": "input-l"}, "td"]):
outText = message1.get_text()
f.write( outText + '\n')
f.close()
errFile.close()
答案 0 :(得分:-1)
我尝试的解决方案是选择兄弟https://www.crummy.com/software/BeautifulSoup/bs4/doc/#next-sibling-and-previous-sibling
首先找到标签然后每个标签找到兄弟
修改强>
from bs4 import BeautifulSoup
soup = BeautifulSoup(open("/Projects/python/webscraping/web.html"),"html5lib")
for div in soup.findAll(["div", {"class": "labellong"}]):
labelName = div.get_text()
for sibling in div.next_siblings:
if (sibling.name == "input"):
inputName = sibling['name']
break
print labelName, inputName
for row in soup.findAll(["tr"]):
labelName = None
inputName = None
for td in row:
if (td.name == 'td'):
if (not labelName): # labelName not yet set
labelName = td.get_text()
else: # second td so inputName
inputName = td.contents[0]['name']
print labelName, inputName