使用Python中的相对XPath访问表值

时间:2017-12-26 23:07:25

标签: python xpath

我正在尝试找到一个相对Xpath(非绝对Xpath),它允许我从这个URL中提取数据:https://www.sec.gov/Archives/edgar/data/1000228/000100022810000006/the10k_2009.htm

我的代码如下。 SalesB返回一个值('233,715'),但SalesA返回空。我做错了什么?

   lapply(df_list,transform,month=mymonths[month])
[[1]]
  month val
1   JAN   1
2   FEB   2
3   MAR   5

[[2]]
  month val
1   JAN   1
2   FEB   2
3   MAR   5
4   APR   6
5   MAY   8

creating a new variable: 

        lapply(df_list,transform,newcolumn=mymonths[month])
    [[1]]
      month val newcolumn
    1     1   1       JAN
    2     2   2       FEB
    3     3   5       MAR

    [[2]]
      month val newcolumn
    1     1   1       JAN
    2     2   2       FEB
    3     3   5       MAR
    4     4   6       APR
    5     5   8       MAY

SalesB返回下面显示的值,可以通过SEC_pageA变量找到(参见https://www.sec.gov/Archives/edgar/data/320193/000119312515356351/d17062d10k.htm)。

enter image description here

我希望SalesA返回“净销售额”数字,如下所示(即6,538,336),可在此处找到:https://www.sec.gov/Archives/edgar/data/1000228/000100022810000006/the10k_2009.htm

enter image description here

1 个答案:

答案 0 :(得分:0)

因为某些文字不在一行中,因为xpath找不到你真正想要的东西。

from lxml import html
import requests

xpath_a = """
//*[text()[contains(., "CONSOLIDATED
      STATEMENTS OF INCOME")]]/following::td[contains(., "Net
      sales")][1]/following-sibling::td[@valign="bottom"][3]//text()
      """

SEC_pageA = requests.get('https://www.sec.gov/Archives/edgar/data/1000228/000100022810000006/the10k_2009.htm')
SEC_treeA = html.fromstring(SEC_pageA.content)
SalesA = SEC_treeA.xpath(xpath_a)

print(SalesA)

打印

['6,538,336']