tr[2]
下面指定的contentB
只会检索一个tr
标记,当我想循环遍历表格中的所有tr
标记然后追加{ {1}}列表td
的内容。
e
打印(e)中
下面的文字是我正在使用的HTML的片段
for i in range(1,5):
contentB = tree.xpath("//table[@class='yfnc_tabledata1']/tr[1]/td/table/tr[2]/td[{i}]".format(i=i))[0].text_content().strip()
if re.match(r'[A-Z]', contentB) is None:
contentB = int(contentB.replace(',', ''))
e.append(contentB)
答案 0 :(得分:1)
如果我正确理解您的要求,则只需将tr[2]
替换为tr
。
谓词[2]
在此限制您使用第二个匹配的tr
元素;删除它会删除该限制。
EDITED
要提取表格单元格的文本内容,您可以将代码修改为:
for i in range(1,5):
# list of cells in column i of table
collist = tree.xpath("//table[@class='yfnc_tabledata1']//table//tr/td[{i}]".format(i=i))
contentB = [c.text_content().strip() for c in collist]
# here contentB will be a list where each element is the text of one of the cells
# in column i of the table
##continue processing per your desired result...
答案 1 :(得分:0)
不确定之前的代码片段是否回答了您的问题。如果没有,这是我的解决方案。请注意原始xpath中未包含的其他“tbody”元素。
import lxml
import re
tree=lxml.html.parse("stack-tmp.html")
e=[]
rows = tree.xpath('//table[@class="yfnc_tabledata1"]/tbody/tr[1]/td/table/tbody/tr')
for row in rows:
for td in row.xpath('./td'):
try:
thistext=td.text_content().strip()
if thistext > "":
if re.match(r'[A-Z]', thistext) is None:
e.append(int(thistext.replace(',','')))
else:
e.append(thistext)
except:
pass
print(e)
提取以下内容:
['Period Ending',
'Total Revenue', 31821000, 30871000, 29904000,
'Cost of Revenue', 16447000, 16106000, 15685000,
'Gross Profit', 15374000, 14765000, 14219000
'Operating Expenses',
'Research Development', 1770000, 1715000, 1634000,
'Selling General and Administrative', 6469000, 6384000, 6102000,
'Non Recurring',
'Others',
'Total Operating Expenses',
'Operating Income or Loss', 7135000, 6666000, 6483000,
'Income from Continuing Operations',
'Total Other Income/Expenses Net', 33000, 41000, 39000,
'Earnings Before Interest And Taxes', 7168000, 6707000, 6522000,
'Interest Expense', 142000, 145000, 171000,
'Income Before Tax', 7026000,6562000, 6351000,
'Income Tax Expense', 2028000, 1841000, 1840000,
'Minority Interest',
'Net Income From Continuing Ops', 4956000, 4659000, 4444000,
'Non-recurring Events',
'Discontinued Operations',
'Extraordinary Items',
'Effect Of Accounting Changes',
'Other Items',
'Net Income', 4956000, 4659000, 4444000,
'Preferred Stock And Other Adjustments',
'Net Income Applicable To Common Shares', 4956000, 4659000, 4444000]