Python 3.4:XPATH:循环遍历tr标签和嵌入​​式td标签

时间:2015-06-15 08:54:30

标签: python-3.x xpath lxml

tr[2]下面指定的contentB只会检索一个tr标记,当我想循环遍历表格中的所有tr标记然后追加{ {1}}列表td的内容。

e

打印(e)中

下面的文字是我正在使用的HTML的片段

for i in range(1,5):
    contentB = tree.xpath("//table[@class='yfnc_tabledata1']/tr[1]/td/table/tr[2]/td[{i}]".format(i=i))[0].text_content().strip()
    if re.match(r'[A-Z]', contentB) is None:
        contentB = int(contentB.replace(',', ''))

    e.append(contentB)

2 个答案:

答案 0 :(得分:1)

如果我正确理解您的要求,则只需将tr[2]替换为tr

谓词[2]在此限制您使用第二个匹配的tr元素;删除它会删除该限制。

EDITED

要提取表格单元格的文本内容,您可以将代码修改为:

for i in range(1,5):
    # list of cells in column i of table
    collist = tree.xpath("//table[@class='yfnc_tabledata1']//table//tr/td[{i}]".format(i=i))
    contentB = [c.text_content().strip() for c in collist]
    # here contentB will be a list where each element is the text of one of the cells 
    # in column i of the table

    ##continue processing per your desired result... 

答案 1 :(得分:0)

不确定之前的代码片段是否回答了您的问题。如果没有,这是我的解决方案。请注意原始xpath中未包含的其他“tbody”元素。

import lxml
import re
tree=lxml.html.parse("stack-tmp.html")
e=[]
rows = tree.xpath('//table[@class="yfnc_tabledata1"]/tbody/tr[1]/td/table/tbody/tr')
for row in rows:
    for td in row.xpath('./td'):
        try:
            thistext=td.text_content().strip()
            if thistext > "":
                if re.match(r'[A-Z]', thistext) is None:
                    e.append(int(thistext.replace(',','')))
                else:
                    e.append(thistext)
        except:
            pass

print(e)

提取以下内容:

['Period Ending', 
'Total Revenue', 31821000, 30871000, 29904000, 
'Cost of Revenue', 16447000, 16106000, 15685000,
'Gross Profit', 15374000, 14765000, 14219000
'Operating Expenses',
'Research Development', 1770000, 1715000, 1634000,
'Selling General and Administrative', 6469000, 6384000, 6102000,
'Non Recurring',
'Others',
'Total Operating Expenses',
'Operating Income or Loss', 7135000, 6666000, 6483000,
'Income from Continuing Operations', 
'Total Other Income/Expenses Net', 33000, 41000, 39000, 
'Earnings Before Interest And Taxes', 7168000, 6707000, 6522000, 
'Interest Expense', 142000, 145000, 171000, 
'Income Before Tax', 7026000,6562000, 6351000, 
'Income Tax Expense', 2028000, 1841000, 1840000, 
'Minority Interest', 
'Net Income From Continuing Ops', 4956000, 4659000, 4444000, 
'Non-recurring Events', 
'Discontinued Operations', 
'Extraordinary Items', 
'Effect Of Accounting Changes', 
'Other Items', 
'Net Income', 4956000, 4659000, 4444000, 
'Preferred Stock And Other Adjustments', 
'Net Income Applicable To Common Shares', 4956000, 4659000, 4444000]