Question

尝试从网站中提取一些有用的信息。我现在有点困难，需要你的帮助！

我需要此表中的信息

http://gbgfotboll.se/serier/?scr=scorers&ftid=57700

我写了这段代码，我得到了我想要的信息：

import lxml.html
from lxml.etree import XPath

url = ("http://gbgfotboll.se/serier/?scr=scorers&ftid=57700")

rows_xpath = XPath("//*[@id='content-primary']/div[1]/table/tbody/tr")
name_xpath = XPath("td[1]//text()")
team_xpath = XPath("td[2]//text()")

league_xpath = XPath("//*[@id='content-primary']/h1//text()")


html = lxml.html.parse(url)

divName = league_xpath(html)[0]

for id,row in enumerate(rows_xpath(html)):
    scorername = name_xpath(row)[0]
    team = team_xpath(row)[0]
    print scorername, team


print divName

我收到此错误

    scorername = name_xpath(row)[0]
IndexError: list index out of range

我明白为什么会收到错误。我真正需要帮助的是我只需要前12行。这是摘录在这三种可能情况下应该做的事情：

如果行数少于12行：取除“最后一行”以外的所有行。

如果有12行：与上面相同..

如果超过12行：只需前12行。

我怎么能这样做？

EDIT1

这不是重复的。当然它是同一个网站。但我已经完成了那个人想要从那一行得到所有价值的东西。我已经可以做了。我不需要最后一行，如果有..我不希望它提取超过12行。

Answer 1

我认为这就是你想要的：

#coding: utf-8
from lxml import etree
import lxml.html

collected = [] #list-tuple of [(col1, col2...), (col1, col2...)]
dom = lxml.html.parse("http://gbgfotboll.se/serier/?scr=scorers&ftid=57700")
#all table rows
xpatheval = etree.XPathDocumentEvaluator(dom)
rows = xpatheval('//div[@id="content-primary"]/div/table[1]/tbody/tr')
# If there are less than 12 rows (or <=12): Take all the rows except the last.
if len(rows) <= 12:
    rows.pop() 
else:
    # If there are more than 12 rows: Simply take the first 12 rows.
    rows = rows[0:12]

for row in rows:
    # all columns of current table row (Spelare, Lag, Mal, straffmal)
    columns = row.findall("td")
    # pick textual data from each <td>
    collected.append([column.text for column in columns])

for i in collected: print i

输出：

enter image description here

Answer 2

这是根据您在帖子中描述的内容获取所需行的方法。这只是基于rows列表的概念的逻辑，您必须根据需要将其合并到您的代码中。

if len(rows) <=12:
    print rows[0:-1]
elif len(rows) > 12:
    print rows[0:12]

使用Xpath，Python从网站中提取信息

2 个答案:

输出：