Question

我试图抓取链接：＆＃34; http://codeforces.com/contest/554/standings＆＃34;

我使用给定的两行来阅读所有参赛者的名字：

extendign slice function = f (L[start:stop:step]):

myArray = np.arange(0,l00)

increment = 10   // You an set the value of increment by if condition.
for i in myArray[0:0:increment]:
     print i

但是table2不会打印所有表格行。我找到了＃34; ＆lt; - suppress HtmlUnknownAttribute - ＆gt; ＆＃34;在我无法抓取的所有行之前写的。

是否有任何特殊原因。

我只是网络抓取的初学者

Answer 1

您可能需要完整地共享代码。我根据您的初始“tr”find_all获得了预期的100个参赛者名称：

import urllib2
from bs4 import BeautifulSoup

response = urllib2.urlopen('http://codeforces.com/contest/554/standings')
html = response.read()
soup = BeautifulSoup(html, 'html.parser')

table = soup.find('table', {'class': 'standings'})
rows = table.find_all('tr')

for row in rows:
    contestant = row.find_all('td', {'class': 'contestant-cell'})
    if len(contestant) > 0:
        # Quick'n dirty dig. Makes un-safe assumptions about the HTML structure.
        print contestant[0].a.string

您会注意到，在获得表格行之后需要进行一些额外挖掘，因为不是每一行都包含参赛者信息。

网页抓取<！ - 压制HtmlUnknownAttribute - >

1 个答案: