如何从BeautifulSoup(Python)中的表中获取第一个子表行

时间:2015-07-22 05:40:35

标签: python beautifulsoup

这是代码和示例结果,我只想要忽略其余部分的表的第一列。请帮忙。 Stackoverflow上有类似的问题,但它们没有帮助。

<tr>
<td>JOHNSON</td>
<td> 2,014,470 </td>
<td>0.81</td>
<td>2</td>
</tr>


I want JOHNSON only, as it is the first child.
My python code is :

import requests
  from bs4 import BeautifulSoup
 def find_raw():
      url = 'http://names.mongabay.com/most_common_surnames.htm'
      r = requests.get(url)
      html = r.content
      soup = BeautifulSoup(html)
      for n in soup.find_all('tr'):
          print n.text

  find_raw()
What I get:
SMITH 2,501,922 1.0061
JOHNSON 2,014,470 0.812

2 个答案:

答案 0 :(得分:3)

您可以使用tr找到所有find_all代码,然后为每个tr find找到td(仅提供第一个)for tr in soup.find_all('tr'): td = tr.find('td') if td: print td 。如果存在,则打印出来:

void shuffle_array(int* array, const int size){
  /* given an array of size size, this is going to randomly
   * attribute a number from 0 to size-1 to each of the
   * array's elements; the numbers don't repeat */
  int i, j, r;
  bool in_list;
  for(i = 0; i < size; i++){
    in_list = 0;
    r = mt_lrand() % size; // my RNG function
    for(j = 0; j < size; j++)
      if(array[j] == r){
    in_list = 1;
    break;
      }
    if(!in_list)
      array[i] = r;
    else
      i--;
  }
}

答案 1 :(得分:2)

Iter到tr,然后打印第一个td的文本:

for tr in bs4.BeautifulSoup(data).select('tr'):
    try:
        print tr.select('td')[0].text
    except:
        pass

或更短:

>>> [tr.td for tr in bs4.BeautifulSoup(data).select('tr') if tr.td]
[<td>SMITH</td>, <td>JOHNSON</td>, <td>WILLIAMS</td>, <td>JONES</td>, ...]

相关帖子: