Question

以下是从我想要网页搜索的HTML代码中提取的内容。给出：

<tbody>
  <tr>
     <th>SAT Math</th>
     <td>"541 average"</td>
  </tr>
</tbody>

我正在使用Python和Beautiful Soup进行网页搜索并提取出541，但我的问题是：

一旦我提取了＆＃34; 541平均值＆＃34;如何摆脱所有多余的材料 - 例如对于GPA我如何摆脱＆＃34;平均＆＃34;？

非常感谢，我非常感谢任何可以提供帮助的人！

（抱歉，我是Python和网络抓取的初学者）

当前代码：

import urllib2
from bs4 import BeautifulSoup

import csv
from datetime import datetime

quote_page = 'https://www.collegedata.com/cs/data/college/college_pg02_tmpl.jhtml?schoolId='+str(i)
page = urllib2.urlopen(quote_page)

soup = BeautifulSoup(page, 'html.parser')
table = soup.find("div", attrs={"id":"section8"})

name_box = soup.find('div', attrs={'class': 'cp_left'}).find('h1')
name = name_box.text.strip() # strip() is used to remove starting and trailing
print name

datasets = []
for row in table.find_all("tr")[1:]:

    if ((zip(th.get_text() for th in row.find_all("th")))!=[(u'SAT Math',)]):
        continue

    dataset = zip((th.get_text() for th in row.find_all("th")), (td.get_text() for td in row.find_all("td")))
    datasets.append(dataset)

    for dataset in datasets:
        for field in dataset:
            print format(field[1])

Answer 1

如果您在结果中始终使用“平均”文本，则可以尝试仅使用正则表达式提取数字。

你基本上需要操纵字符串。

这样的事情：

import re

s = "541 average"
extractTheNumber = re.findall('(\d+?)\s', s)

print(extractTheNumber[0])

匹配多个连续数字字符，直到找到空格（该空格从匹配中排除。）

尝试使用此工具，这可能非常有用：https://regex101.com/

如何仅抓取数字而不是数字后面的文字？

1 个答案: