跳过标签内的特定文本,python beautifulsoup

时间:2017-09-08 09:38:44

标签: python web-scraping beautifulsoup

这是我的html文件

<tr>
<td>1</td>
<td style="font-weight: bold;"><a href="#" onclick="javascript:TollPlazaPopup(272);"> Kherki Daula </a></td> 
<td style="font-weight: bold;">60 <a onclick="return popitup(" https:="" www.google.co.in="" maps="" @28.395604,76.98176,17.52z="" data="!5m1!1e1?hl=en')'" href="https://www.google.co.in/maps/@28.395604,76.98176,17.52z/data=!3m1!1e3!5m1!1e1?hl=en" target="_Blank"> (Live Traffic)</a> &nbsp;&nbsp; - &nbsp;&nbsp; <a href="#" title="Click here to get estimated travel time." id="0-232X" onclick="javascript:TollPlazaTrafficTime(272,this);">ET</a>
</td>
</tr>
<tr>
<td>2</td>
<td style="font-weight: bold;"><a href="#" onclick="javascript:TollPlazaPopup(213);"> Shahjahanpur </a></td>
<td style="font-weight: bold;">125 <a onclick="return popitup(" https:="" www.google.co.in="" maps="" @27.99978,76.430522,17.52z="" data="!5m1!1e1?hl=en')'" href="https://www.google.co.in/maps/@27.99978,76.430522,17.52z/data=!3m1!1e3!5m1!1e1?hl=en" target="_Blank"> (Live Traffic)</a> &nbsp;&nbsp; - &nbsp;&nbsp; <a href="#" title="Click here to get estimated travel time." id="1-179X" onclick="javascript:TollPlazaTrafficTime(213,this);">ET</a>
</td>
</tr>

现在我正在抓,所以结果就像

Sr No.  Toll Plaza  Car/Jeep/Van(Rs.)
1   Kherki Daula    60 (Live Traffic)    -    ET
2   Shahjahanpur    125 (Live Traffic)    -    ET
                 Total Charges(Rs.) 90

我想从行

跳过文本(实时流量 - ET)

我的python代码是

tbody = soup('table' ,{"class":"tollinfotbl"})[0].find_all('tr')[3:]
for row in tbody:
    cols = row.findChildren(recursive=False)
    cols = [ele.text.contents[0] for ele in cols]
    if cols:
        sno = str(cols[0])
        Toll_plaza = str(cols[1])
        cost = str(cols[2])

        query = "INSERT INTO tryroute (sno,Toll_plaza, cost) VALUES (%s, %s, %s);"

当我使用.contents[0]时,我收到错误cols = [ele.text.content[0] for ele in cols] AttributeError: 'str' object has no attribute 'content'

任何帮助将不胜感激。

2 个答案:

答案 0 :(得分:2)

您收到此错误是因为您尝试在str对象上使用“contents”,即ele.text

ele.text # returns a string object (which in your case contains the whole text in that particular tag)

要获取标签的内容,您必须这样做

ele.contents # inside your list comprehension, this will return a list of all the children of that particular tag

答案 1 :(得分:1)

您可以使用re从原始数据中提取数据。您不需要获得content[],因为您很容易出错,因为您明确提供了索引而且不灵活。

在复制下面的代码之前,在顶部添加import re

for row in tbody:
    cols = row.findChildren(recursive=False)
    cols = [ele.text for ele in cols]
    if cols:
        sno = str(cols[0])
        Toll_plaza = str(cols[1])
        cost_raw = str(cols[2])

        compiled = re.compile('^(\d+)\s*\(', flags=re.IGNORECASE | re.DOTALL)
        match = re.search(compiled, cost_raw)
        if match:
            cost = match.group(1)

        query = "INSERT INTO tryroute (sno,Toll_plaza, cost) VALUES (%s, %s, %s);"

如果您需要澄清,请告诉我。