使用BeautifulSoup获取第n个元素

时间:2012-01-04 09:09:19

标签: python web-scraping beautifulsoup

从一张大桌子上我想要使用BeautifulSoup阅读第5,10,15,20行。我该怎么做呢? findNextSibling和递增计数器的方法是什么?

5 个答案:

答案 0 :(得分:43)

您还可以使用findAll获取列表中的所有行,然后使用切片语法访问所需的元素:

rows = soup.findAll('tr')[4::5]

答案 1 :(得分:3)

如果您知道要选择的行号,可以使用row = 5 while true element = soup.select('tr:nth-of-type('+ row +')') if len(element) > 0: # element is your desired row element, do what you want with it row += 5 else: break 在美丽的汤中轻松完成。 (注意:这是在bs4中)

extern "C"

答案 2 :(得分:1)

作为一般解决方案,您可以将表转换为嵌套列表并迭代...

import BeautifulSoup

def listify(table):
  """Convert an html table to a nested list""" 
  result = []
  rows = table.findAll('tr')
  for row in rows:
    result.append([])
    cols = row.findAll('td')
    for col in cols:
      strings = [_string.encode('utf8') for _string in col.findAll(text=True)]
      text = ''.join(strings)
      result[-1].append(text)
  return result

if __name__=="__main__":
    """Build a small table with one column and ten rows, then parse into a list"""
    htstring = """<table> <tr> <td>foo1</td> </tr> <tr> <td>foo2</td> </tr> <tr> <td>foo3</td> </tr> <tr> <td>foo4</td> </tr> <tr> <td>foo5</td> </tr>  <tr> <td>foo6</td> </tr>  <tr> <td>foo7</td> </tr>  <tr> <td>foo8</td> </tr>  <tr> <td>foo9</td> </tr>  <tr> <td>foo10</td> </tr></table>"""
    soup = BeautifulSoup.BeautifulSoup(htstring)
    for idx, ii in enumerate(listify(soup)):
        if ((idx+1)%5>0):
            continue
        print ii

运行那个......

[mpenning@Bucksnort ~]$ python testme.py
['foo5']
['foo10']
[mpenning@Bucksnort ~]$

答案 3 :(得分:1)

另一种选择,如果你更喜欢原始的HTML ......

"""Build a small table with one column and ten rows, then parse it into a list"""
htstring = """<table> <tr> <td>foo1</td> </tr> <tr> <td>foo2</td> </tr> <tr> <td>foo3</td> </tr> <tr> <td>foo4</td> </tr> <tr> <td>foo5</td> </tr>  <tr> <td>foo6</td> </tr>  <tr> <td>foo7</td> </tr>  <tr> <td>foo8</td> </tr>  <tr> <td>foo9</td> </tr>  <tr> <td>foo10</td> </tr></table>"""
result = [html_tr for idx, html_tr in enumerate(soup.findAll('tr')) \
     if (idx+1)%5==0]
print result

运行那个......

[mpenning@Bucksnort ~]$ python testme.py
[<tr> <td>foo5</td> </tr>, <tr> <td>foo10</td> </tr>]
[mpenning@Bucksnort ~]$

答案 4 :(得分:0)

以下是您如何使用thisgazpacho维基百科页面上抓取第5个分发链接的信息:

var response = await client.GetAsync("https://api.com");
try
{
  var str = await response.Content.ReadAsStringAsync();
  var settings = new JsonSerializerSettings
  {
    NullValueHandling = NullValueHandling.Ignore,
    MissingMemberHandling = MissingMemberHandling.Ignore
  };
  var results = JsonConvert.DeserializeObject<List<dynamic>>(str, settings);
  return results.Where(x => x.code == 200).Select(x => (MyType)x.body).ToList();
}
catch (Exception ex)
{
  Log.Error($"There was an error parsing: {ex}");
}