无法使用多个html标记抓取网页

时间:2016-11-22 13:30:18

标签: python html web-scraping beautifulsoup

我在stackoverflow上彻底搜索过但找不到合适的解决方案。我正在抓取旧网站,我想提取所有标签和输入名称。旧网页的html格式是这样的

<div class="labellong">First Name</div>
<INPUT class="input-l"  name="firstname">

<div class="labellong">Last Name</div>
<INPUT class="input-l"  name="lastname">

<div class="labellong">Gender</div>
<input type="radio" name="gender" value="male"> Male<br>
<input type="radio" name="gender" value="female"> Female<br>

<table>
    <tr valign="top">
        <td width="174">User Name</td>
        <td width="888"><input name="username" value="" id="username" class="input-m" /></td>
    </tr>
    <tr>
        <td width="174">User Account</td>
        <td width="888"><input name="useraccount" value="" id="uaseraccount" class="input-m" /></td>
    </tr>
</table>

我想使用带有beautifulsoup的python提取输出

First Name, firstname 
Last Name, lastname 
Gender, gender 
User Name, username 
User Account, useraccount

我确实试过findall方法但是失败了,因为我需要标签(文本)和输入标签名称,是否有解决方案来废弃带有标签文本的多个html标签?感谢

我是网页抓取的新手,这是我尝试过的代码

from bs4 import BeautifulSoup
import urllib.request as urllib2

f = open("g:\output.txt", "w")
errFile = open("g:\error.txt", "w")

url = "file:///g://pharmacy.htm"
file = urllib2.urlopen(url)
soup = BeautifulSoup(file)


for message1 in soup.findAll(["div", {"class": "labellong"}, "input", {"class": "input-l"}, "td"]):
    outText = message1.get_text()
    f.write( outText + '\n')


f.close()
errFile.close()

1 个答案:

答案 0 :(得分:-1)

我尝试的解决方案是选择兄弟https://www.crummy.com/software/BeautifulSoup/bs4/doc/#next-sibling-and-previous-sibling

首先找到标签然后每个标签找到兄弟

修改

from bs4 import BeautifulSoup

soup = BeautifulSoup(open("/Projects/python/webscraping/web.html"),"html5lib")

for div in soup.findAll(["div", {"class": "labellong"}]):
    labelName = div.get_text()
    for sibling in div.next_siblings:
        if (sibling.name == "input"):
            inputName = sibling['name']
            break
    print labelName, inputName

for row in soup.findAll(["tr"]):
    labelName = None
    inputName = None
    for td in row:
        if (td.name == 'td'):
            if (not labelName): # labelName not yet set
                labelName = td.get_text()
            else: # second td so inputName
                inputName = td.contents[0]['name']
                print labelName, inputName