Question

我有一个HTML如下：

<table class="stocksTable" summary="株価詳細">
<tr>
<th class="symbol"><h1>(株)みずほフィナンシャルグループ</h1></th>
<td class="stoksPrice realTimChange">
<div class="realTimChangeMod">
</div>
</td>
td class="stoksPrice">191.1</td>
<td class="change"><span class="yjSt">前日比</span><span class="icoUpGreen yjMSt">+2.5（+1.33%）</span></td>
</tr>
</table>

我尝试从包含191.1的行中提取td class="stoksPrice">191.1</td>。

soup = BeautifulSoup(html)
res = soup.find_all('stoksPrice')
print (res)

但结果是[]。如何找到它们？

Answer 1

似乎有两个问题：

首先，您对find_all的使用无效。您正在搜索名为stoksPrice的标记名的当前方式是错误广告，您的标记为table，tr，td，div，{{1 }}。您需要将其更改为：

span

搜索该类的标签。

其次，您的HTML格式不正确。 >>> res = soup.find_all(class_='stoksPrice')的列表是：

stoksPrice

应该是：

</td>
td class="stoksPrice">191.1</td>

（请注意</td> <td class)="stoksPrice">191.1</td>之前的<）不确定这是否是Stack Overflow中的复制错误，或者HTML最初是格式错误的，但这并不容易解析......

Answer 2

由于有多个标签具有相同的类，因此您可以使用CSS选择器来获得完全匹配。

  $imgage_path_uri = 'Path of image URI';
$derivative_uri = image_style_path('my_image_style', $imgage_path_uri);
$style = image_style_load('my_image_style');
// Generate derivative
$generated = image_style_create_derivative($style, $imgage_path_uri, $derivative_uri);
$s3_bucket = variable_get('s3fs_bucket', '');
$derivative_uri_u = str_replace('s3://', 'public://', $derivative_uri);
$derivative_uri = str_replace('s3://', '', $derivative_uri);
try{
  $config = _s3fs_get_config();
  $s3 = _s3fs_get_amazons3_client($config);
}catch(S3fsException $e){
  form_set_error('form', $e->getMessage());
   return FALSE;
}
$result = $s3->deleteObject(array(
  'Bucket' => $s3_bucket,
  'Key'    => $derivative_uri
));
//create destination 
$destination = file_stream_wrapper_uri_normalize($filepath .'/'. $filename);
//move your image file to from source to destination
$destination = file_unmanaged_move($source, $destination, FILE_EXISTS_REPLACE);
$dafd = image_style_create_derivative($style, $destination, $derivative_uri_u);
$result = $s3->putObject(array(
    'Bucket'       => $s3_bucket,
    'Key'          => $derivative_uri,
    'SourceFile'   => $derivative_uri_u,
    'ACL'          => 'public-read',
));

或者，您可以使用html = '''<table class="stocksTable" summary="株価詳細"> <tr> <th class="symbol"><h1>(株)みずほフィナンシャルグループ</h1></th> <td class="stoksPrice realTimChange"> <div class="realTimChangeMod"> </div> </td> <td class="stoksPrice">191.1</td> <td class="change"><span class="yjSt">前日比</span><span class="icoUpGreen yjMSt">+2.5（+1.33%）</span></td> </tr> </table>''' soup = BeautifulSoup(html, 'lxml') print(soup.select_one('td[class="stoksPrice"]').text) # 191.1和lambda来获得相同内容。

find

注意： BeautifulSoup会转换列表中的多值类属性。因此，两个print(soup.find(lambda t: t.name == 'td' and t['class'] == ['stoksPrice']).text) # 191.1代码的类看起来像td和['stoksPrice']。

Answer 3

以下是使用findAll执行此操作的一种方法。

因为以前的所有stoksPrice都是空的，所剩下的唯一一个是价格的那个..

您可以使用try / except子句检查它是否是浮点数。

如果不是，它将继续迭代，如果是，它将返回它。

res = soup.findAll("td", {"class": "stoksPrice"})
for r in res:
    try:
        t = float(r.text)
        print(t)
    except:
        pass

191.1

无法从汤中获取文字

3 个答案: