使用beautifulsoup时按分割HTML文本

时间:2016-03-22 13:31:02

标签: python regex beautifulsoup

HTML code:

<td> <label class="identifier">Speed (avg./max):</label> </td> <td class="value"> <span class="block">4.5 kn<br>7.1 kn</span> </td>

我需要将值4.5 kn和7.1作为单独的列表项,因此我可以单独附加它们。我不想拆分它我想用re.sub拆分文本字符串,但它不起作用。我试过使用替换替换br,但它没有用。任何人都可以提供任何见解吗?

Python代码:

  def NameSearch(shipLink, mmsi, shipName):
    from bs4 import BeautifulSoup
    import urllib2
    import csv
    import re

    values = []
    values.append(mmsi)
    values.append(shipName)
    regex = re.compile(r'[\n\r\t]')
    i = 0
    with open('Ship_indexname.csv', 'wb')as f:
        writer = csv.writer(f)
        while True:
            try:
                shipPage = urllib2.urlopen(shipLink, timeout=5)
            except urllib2.URLError:
                continue
            except:
                continue
            break
        soup = BeautifulSoup(shipPage, "html.parser")  # Read the web page HTML
        #soup.find('br').replaceWith(' ')
        #for br in soup('br'):
            #br.extract()
        table = soup.find_all("table", {"id": "vessel-related"})  # Finds table with class table1
        for mytable in table:                                   #Loops tables with class table1
            table_body = mytable.find_all('tbody')                  #Finds tbody section in table
            for body in table_body:
                rows = body.find_all('tr')                #Finds all rows
                for tr in rows:                                 #Loops rows
                    cols = tr.find_all('td')                    #Finds the columns
                    for td in cols:                             #Loops the columns
                        checker = td.text.encode('ascii', 'ignore')
                        check = regex.sub('', checker)
                        if check == ' Speed (avg./max): ':
                            i = 1
                        elif i == 1:
                            print td.text
                            pat=re.compile('<br\s*/>')
                            print pat.sub(" ",td.text)
                            values.append(td.text.strip("\n").encode('utf-8'))  #Takes the second columns value and assigns it to a list called Values
                            i = 0
    #print values
    return values


NameSearch('https://www.fleetmon.com/vessels/kind-of-magic_0_3478642/','230034570','KIND OF MAGIC')

1 个答案:

答案 0 :(得分:0)

找到&#34;速度(平均/最大)&#34;首先标记,然后通过.find_next()

转到该值
from bs4 import BeautifulSoup   

data = '<td> <label class="identifier">Speed (avg./max):</label> </td> <td class="value"> <span class="block">4.5 kn<br>7.1 kn</span> </td>'
soup = BeautifulSoup(data, "html.parser")

label = soup.find("label", class_="identifier", text="Speed (avg./max):")
value = label.find_next("td", class_="value").get_text(strip=True)
print(value)  # prints 4.5 kn7.1 kn

现在,您可以从字符串中提取实际数字:

import re

speed_values = re.findall(r"([0-9.]+) kn", value)
print(speed_values)

打印['4.5', '7.1']

然后,您可以进一步将值转换为浮点数并解压缩为单独的变量:

avg_speed, max_speed = map(float, speed_values)