在提取/分解某些'td'标签后,无法访问第一行表之外的'td'标签

时间:2016-05-22 09:13:04

标签: python python-3.x web-scraping beautifulsoup bs4

在这个包含两行四列的示例表中,每行的前两个单元格包含PDF文件,这就是我要提取的内容。每行中的另外两个是ZIP文件。

我知道我可以直接过滤'findAll'方法中的PDF文件,但这个表只是一小部分。并且整个HTML页面都非常不一致(至少对我而言)。

所以,我正在考虑删除不包含PDF文件的标签。我不明白结果。

当我使用'decompose'删除包含ZIP文件的标签时,只能访问第一行中的PDF文件;第二行中的两个不打印。但是,如果我打印整个汤,第二行中的PDF文件仍然存在。我只是无法使用findAll访问它们。打印'soup.contents'也只给我第一行。

当我使用'extract'而不是'decompose'时,再次只能访问第一行。但是,使用此方法,第一行中的第一个ZIP文件也会被打印(我提取并且不应该打印)。

这是我的代码:

from bs4 import BeautifulSoup
import re

html = '''
<tr>
<td><a href="http://example.com/r1c1.PDF">Chapter 3</a></td>
<td><a href="http://example.com/r1c2.pdf">Chapter 3</a></td>
<td><a href="http://example.com/r1c3.zip">Protect...stems</a></td>
<td><a href="http://example.com/r1c4.zip">Protect...stems</a></td>
</tr>
<tr>
<td><a href="http://example.com/r2c1.PDF">Chapter 4</a></td>
<td><a href="http://example.com/r2c2.pdf">Chapter 4</a></td>
<td><a href="http://example.com/r2c3.zip">Busine...Part 1 </a></td>
<td><a href="http://example.com/r2c4.zip">Busine...Part 1 </a></td>
</tr>
'''

soup = BeautifulSoup(html, 'lxml')

for i in soup.findAll('td'):
    print(i.a['href'])                   # This prints all the links correctly

print()

for i in soup.findAll('td'):
    if re.match('.*zip', i.a['href']):
        i.extract()

for i in soup.findAll('td'):
    print(i.a['href'])                  # This prints only the first two PDF files

这是输出('分解'):

# Here are all the links
http://example.com/r1c1.PDF
http://example.com/r1c2.pdf
http://example.com/r1c3.zip
http://example.com/r1c4.zip
http://example.com/r2c1.PDF
http://example.com/r2c2.pdf
http://example.com/r2c3.zip
http://example.com/r2c4.zip

# Only the first row gets printed
http://example.com/r1c1.PDF
http://example.com/r1c2.pdf

当我使用'extract'时,这是输出的最后一部分:

.
.
.

# The first row does get printed, but so does the first ZIP file in it
http://example.com/r1c1.PDF
http://example.com/r1c2.pdf
http://example.com/r1c3.zip

我错过了什么或做错了什么?

我正在使用的程序:

  • Python - 3.4.0
  • bs4 - 4.4.0

修改

以下是我输出的屏幕截图:

Output of the script

第一个输出是使用'decompose'方法的结果,第二个输出是'extract'。

我不知道这是否重要,但我实际上并没有将其从网上删除。我先下载了这些页面,然后从本地副本中删除了。

编辑2:

我想将所有PDF文件合并为两个单独的文件(学习材料和练习手册 - 前两列)。我打算在脚本本身中包含该步骤。

为了使自动重命名更容易,我想用某种模式重命名文件,例如:

[study/practice][module] [text of the 'td' tag].pdf

例如,'0'表示学习材料的一部分,'1'表示练习手册

因此,对于“商业环境中的第1章发展”(包含在模块1中):

'01第1章...... .pdf'将成为学习材料的一部分

'11第1章...... .pdf'将成为实践手册的一部分。

研究材料的第4章(模块2中的第一章)将是'02 Ch 4 ... .pdf'。

该网站上的网页不一致,我认为如果我删除所有包含ZIP文件链接或仅包含不可破坏空间('nbsp)的'td'标签,我的工作会更容易。是当我遇到无法访问所有'td'元素的问题时。

1 个答案:

答案 0 :(得分:0)

您可以在 id = cpost div 之后从第二个表中获取所有pdf和zip:

import requests
from bs4 import BeautifulSoup

r = requests.get("http://www.icai.org/post.html?post_id=10160").text
soup = BeautifulSoup(r, "lxml")

# get second table after cpost 
table = soup.select_one("#cpost").select_one("table:nth-of-type(2)")

# find all anchor tags where the href value endswith .pdf and .zip
pdfs = [a["href"] for a in table.select("td a[href$=.pdf]")]
zips =  [a["href"] for a in table.select("td a[href$=.zip]")]
print(pdfs)
print(zips)

这给了你:

['http://220.227.161.86/28999sm_finalnew_cp-initialpages.pdf', 'http://220.227.161.86/31899sm_finalnew_vol2A_iniipages.pdf', 'http://220.227.161.86/18905sm_finalnew_cp1a.pdf', 'http://220.227.161.86/21520sm_finalnew_vol2_cp1.pdf', 'http://220.227.161.86/18854sm_finalnew_cp2.pdf', 'http://220.227.161.86/21521sm_finalnew_vol2_cp2.pdf', 'http://220.227.161.86/18855sm_finalnew_cp3.pdf', 'http://220.227.161.86/21522sm_finalnew_vol2_cp3.pdf', 'http://220.227.161.86/36605sm_finalnew_feedbackform-m1.pdf', 'http://220.227.161.86/36606sm_finalnew_cp-initialpages-m2.pdf', 'http://220.227.161.86/18856sm_finalnew_cp4.pdf', 'http://220.227.161.86/21523sm_finalnew_vol2_cp4.pdf', 'http://220.227.161.86/18857sm_finalnew_cp5.pdf', 'http://220.227.161.86/21524sm_finalnew_vol2_cp5.pdf', 'http://220.227.161.86/18858sm_finalnew_cp6.pdf', 'http://220.227.161.86/21525sm_finalnew_vol2_cp6.pdf', 'http://220.227.161.86/18859sm_finalnew_cp7.pdf', 'http://220.227.161.86/21526sm_finalnew_vol2_cp7.pdf', 'http://220.227.161.86/18860sm_finalnew_cp8.pdf', 'http://220.227.161.86/21527sm_finalnew_vol2_cp8.pdf', 'http://220.227.161.86/18861sm_finalnew_cp9.pdf', 'http://220.227.161.86/21528sm_finalnew_vol2_cp9.pdf', 'http://220.227.161.86/31901sm_finalnew_cp-feedbackformvolA.pdf', 'http://220.227.161.86/36607sm_finalnew_cp-initialpages-m3.pdf', 'http://220.227.161.86/31900sm_finalnew_vol2B_iniipages.pdf', 'http://220.227.161.86/18862sm_finalnew_cp10.pdf', 'http://220.227.161.86/21529sm_finalnew_vol2_cp10.pdf', 'http://220.227.161.86/18970sm_finalnew_cp11.pdf', 'http://220.227.161.86/21530sm_finalnew_vol2_cp11.pdf', 'http://220.227.161.86/18971sm_finalnew_cp12.pdf', 'http://220.227.161.86/21531sm_finalnew_vol2_cp12.pdf', 'http://220.227.161.86/18863sm_finalnew_cp13.pdf', 'http://220.227.161.86/21532sm_finalnew_vol2_cp13.pdf', 'http://220.227.161.86/18972sm_finalnew_cp14.pdf', 'http://220.227.161.86/21533sm_finalnew_vol2_cp14.pdf', 'http://220.227.161.86/18864sm_finalnew_cp15.pdf', 'http://220.227.161.86/21534sm_finalnew_vol2_cp15.pdf', 'http://220.227.161.86/18865sm_finalnew_cp16.pdf', 'http://220.227.161.86/21535sm_finalnew_vol2_cp16.pdf', 'http://220.227.161.86/29001sm_finalnew_cp-appendix.pdf', 'http://220.227.161.86/31903sm_finalnew_cp-appendix-pmvolab.pdf', 'http://220.227.161.86/29000sm_finalnew_cp-feedbackform.pdf', 'http://220.227.161.86/31902sm_finalnew_cp-feedbackformvolB.pdf']
['http://www.mediafire.com/file/7g7zhhlzmd2u49u/P5Ch1DevelopmentsBusinessEnvironmenP1.zip', 'http://www.mediafire.com/file/tx0dhvlkx9w518s/P5Ch1DevelopmentSBusinessEnvironmentP1.zip', 'http://www.mediafire.com/file/751095mme1hd5z7/P5Ch1DevelopmentsBusinessEnvironmentP2.zip', 'http://www.mediafire.com/file/ifn2nn57czr5djm/P5Ch1DevelopmentsBusinessEnvironmentP2.zip', 'http://www.mediafire.com/file/trf1adc5a1gr4at/P5Ch1DevelopmentsBusinessEnvironmentP3.zip', 'http://www.mediafire.com/file/ofjv0pu00v35ivc/P5Ch1DevelopmentsBusinessEnvironmentP4.zip', 'http://www.mediafire.com/file/us454ahc99llili/P5Ch1DevelopmentsBusinessEnvironmentP4.zip', 'http://www.mediafire.com/file/etps8fn6y26qyyn/P5Ch1DevelopmentsBusinessEnvironmentP5.zip', 'http://www.mediafire.com/file/659coi5bhg8ku50/P5Ch1DevelopmentsBusinessEnvironmentP5.zip', 'http://www.mediafire.com/file/2fhzegia69op9ao/FP5Ch2DecisionMakingAndCVPAnalysisPart1.zip', 'http://www.mediafire.com/file/rrbciytpktmh121/FP5Ch2DecisionMakingAndCVPAnalysisPart1.zip', 'http://www.mediafire.com/file/ivtvwvknl5w7bc7/FP5Ch2DecisionMakingAndCVPAnalysisPart2.zip', 'http://www.mediafire.com/file/ba87eg665dtwaui/FP5Ch2DecisionMakingAndCVPAnalysisPart2.zip', 'http://www.mediafire.com/file/lyzb1yic8l7alst/FP5Ch2DecisionMakingAndCVPAnalysisPart3.zip', 'http://www.mediafire.com/file/8iln8yigbzrla3k/FP5Ch2DecisionMakingAndCVPAnalysisPart3.zip', 'http://www.mediafire.com/file/7aafaoqgsq5dg6u/P5Ch3PricingDecisions.zip', 'http://www.mediafire.com/file/fbmdhlqm8ey3aue/P5Ch3PricingDecisions.zip', 'http://www.mediafire.com/file/0ixhgb0x07qu1an/P5Ch4BudgetP1.zip', 'http://www.mediafire.com/file/wqlk53ol26j4lmm/P5Ch4BudgetP1.zip', 'http://www.mediafire.com/file/b9dsw8sudsud7eg/P5Ch4BudgetP2.zip', 'http://www.mediafire.com/file/mrbw44z2mru2tnw/P5Ch4BudgetP2.zip', 'http://www.mediafire.com/file/hfndpfthdfm7l7s/P5Ch4Budget_3.zip', 'http://www.mediafire.com/file/03yzz48no2ttbta/P5Ch4BudgetP3.zip', 'http://www.mediafire.com/file/usxz01xiuw6rgaj/P5Ch4Budget4.zip', 'http://www.mediafire.com/file/qoyjgu61luvsd71/P5Ch4BudgetP4.zip', 'http://www.mediafire.com/file/m3h6qm2gevdmx4n/P5Ch5StandardCosting.zip', 'http://www.mediafire.com/file/wdj8nsl7to2u6cz/P5Ch5StandardCostingPodcast.zip', 'http://www.mediafire.com/file/x9ypv4oxx573ea4/P5Ch5StandardCostingPart1.zip', 'http://www.mediafire.com/file/ax1taa1ahaga2w3/P5Ch5StandardCostingPart1.zip', 'http://www.mediafire.com/file/z0z0pxp5l44qnz4/P5Ch5StandardCostingPart2.zip', 'http://www.mediafire.com/file/h8stxmsqcdqum9q/P5Ch5StandardCostingPart2.zip', 'http://www.mediafire.com/file/v33bcec4ep6ulxc/P5Ch6CostingofServiceSector.zip', 'http://www.mediafire.com/file/i9y7gebccovv852/P5Ch6CostingofServiceSector.zip', 'http://www.mediafire.com/file/bd61s151fgk1xbs/P5Ch7TransferPricing.zip', 'http://www.mediafire.com/file/iwf47hp8bb6wtdq/P5Ch7TransferPricing.zip', 'http://www.mediafire.com/file/ihn7dz4okrvl8c6/P5Ch8UniformCostingAndIFC.zip', 'http://www.mediafire.com/file/44nqn834616kqvb/P5Ch8UniformCostingAndIFC.zip', 'http://www.mediafire.com/file/hfyr8y6w18klua8/P5Ch9CostsheetPAAndReporting.zip', 'http://www.mediafire.com/file/dnxlsmifismp269/P5Ch9CostSheetPAAndReporting.zip', 'http://www.mediafire.com/file/ef27t3y4ly7zsqa/P5Ch11LINEARPROGRAMMINGPart2.zip', 'http://www.mediafire.com/file/66zwcr87sxu8r08/P5Ch11LINEARPROGRAMMINGV5R.zip', 'http://www.mediafire.com/file/03340dc0g94t3r9/P5Ch11LINEARPROGRAMMINGV5R.zip', 'http://www.mediafire.com/file/i3dcrrbcfcy7xqh/P5Ch11LINEARPROGRAMMINGPart2.zip', 'http://www.mediafire.com/file/j870no55rq8qu55/P5Ch11Transportation_rev.zip', 'http://www.mediafire.com/file/vm75wz1j93s284c/P5Ch11Transportation.zip', 'http://www.mediafire.com/file/0qibqc72f2zdf6g/P5Ch12AssignmentProblem.zip', 'http://www.mediafire.com/file/7iqrbl2ze6wh7mp/P5Ch12AssignmentProblem.zip', 'http://www.mediafire.com/file/nds6vcp6b7ws36h/P5Ch13CriticalpathAnalysis.zip', 'http://www.mediafire.com/file/wurwuaspw9ne954/P5Ch13CriticalpathAnalysis.zip', 'http://www.mediafire.com/file/stxrr65qth1jmi3/P5Ch14PERT.zip', 'http://www.mediafire.com/file/mzlg290e1gy3ce5/P5Ch14PERT.zip', 'http://www.mediafire.com/file/jhad7r6qxexkvug/P5Ch15Simulation.zip', 'http://www.mediafire.com/file/af2r0p99m36uvd4/P5C15SimulationV2.zip', 'http://www.mediafire.com/file/vl59gzd8szy83rw/P5Ch16LearningCurveTheory.zip', 'http://www.mediafire.com/file/439dd5eec4c6u2i/P5Ch16LearningCurveTheory.zip']

与您在网页上看到的内容完全匹配。

如果您只想从第1列开始,并使用td文本重命名:

# get second table after cpost
table = soup.select_one("#cpost").select_one("table:nth-of-type(2)")
rows = table.select("tr")

for row in rows:
    td1 = row.select_one("td:nth-of-type(1)")
    td2 = row.select_one("td:nth-of-type(2)")
    l1, l2 = td1.select_one("a[href$=.zip]"),  td2.select_one("a[href$=.pdf]")
    if l1:
        print("Found zip {}".format(l1["href"]))
        print(td1.text)
    if l2:
        print("Found pdf {}".format(l2["href"]))
        print(td2.text)
    print()

这给了你:

Found pdf http://220.227.161.86/31899sm_finalnew_vol2A_iniipages.pdf
Initial Pages 

Found pdf http://220.227.161.86/21520sm_finalnew_vol2_cp1.pdf
Chapter 1 Developments in the Business Environment

Found zip http://www.mediafire.com/file/751095mme1hd5z7/P5Ch1DevelopmentsBusinessEnvironmentP2.zip
Developments in the Business Environment Part 2 


Found zip http://www.mediafire.com/file/ofjv0pu00v35ivc/P5Ch1DevelopmentsBusinessEnvironmentP4.zip
Developments in the Business Environment Part 4 

Found zip http://www.mediafire.com/file/etps8fn6y26qyyn/P5Ch1DevelopmentsBusinessEnvironmentP5.zip
Developments in the Business Environment Part 5 

Found pdf http://220.227.161.86/21521sm_finalnew_vol2_cp2.pdf
Chapter 2 Decision Making using Cost Concepts and CVP Analysis 

Found zip http://www.mediafire.com/file/ivtvwvknl5w7bc7/FP5Ch2DecisionMakingAndCVPAnalysisPart2.zip
Decision Making using Cost Concepts and CVP Analysis part2

Found zip http://www.mediafire.com/file/lyzb1yic8l7alst/FP5Ch2DecisionMakingAndCVPAnalysisPart3.zip
Decision Making using Cost Concepts and CVP Analysis part3

Found pdf http://220.227.161.86/21522sm_finalnew_vol2_cp3.pdf
Chapter 3 Pricing Decisions 




Found pdf http://220.227.161.86/21523sm_finalnew_vol2_cp4.pdf
Chapter 4 Budget & Budgetary Control 

Found zip http://www.mediafire.com/file/b9dsw8sudsud7eg/P5Ch4BudgetP2.zip
Budget and Budgetary Control Part 2 

Found zip http://www.mediafire.com/file/hfndpfthdfm7l7s/P5Ch4Budget_3.zip
Budget and Budgetary Control Part 3 

Found zip http://www.mediafire.com/file/usxz01xiuw6rgaj/P5Ch4Budget4.zip
Budget and Budgetary Control Part 4 

Found pdf http://220.227.161.86/21524sm_finalnew_vol2_cp5.pdf
Chapter 5 Standard Costing 

Found zip http://www.mediafire.com/file/x9ypv4oxx573ea4/P5Ch5StandardCostingPart1.zip
Standard Costing - Part 1 

Found zip http://www.mediafire.com/file/z0z0pxp5l44qnz4/P5Ch5StandardCostingPart2.zip
Standard Costing - Part 2 

Found pdf http://220.227.161.86/21525sm_finalnew_vol2_cp6.pdf
Chapter 6 Costing of Service Sector 

Found pdf http://220.227.161.86/21526sm_finalnew_vol2_cp7.pdf
Chapter 7 Transfer Pricing 

Found pdf http://220.227.161.86/21527sm_finalnew_vol2_cp8.pdf
Chapter 8 Uniform Costing & Inter-firm Comparison 

Found pdf http://220.227.161.86/21528sm_finalnew_vol2_cp9.pdf
Chapter 9 Cost Sheet, Profitability Analysis and Reporting 

Found pdf http://220.227.161.86/31901sm_finalnew_cp-feedbackformvolA.pdf
Feedback Form 



Found pdf http://220.227.161.86/31900sm_finalnew_vol2B_iniipages.pdf
Initial Pages 

Found pdf http://220.227.161.86/21529sm_finalnew_vol2_cp10.pdf
Chapter 10 Linear Programming 

Found zip http://www.mediafire.com/file/03340dc0g94t3r9/P5Ch11LINEARPROGRAMMINGV5R.zip
Linear Programming – Part 2 

Found pdf http://220.227.161.86/21530sm_finalnew_vol2_cp11.pdf
Chapter 11 The Transportation Problem 

Found pdf http://220.227.161.86/21531sm_finalnew_vol2_cp12.pdf
Chapter 12 The Assignment Problem 

Found pdf http://220.227.161.86/21532sm_finalnew_vol2_cp13.pdf
Chapter 13 Critical Path Analysis 

Found pdf http://220.227.161.86/21533sm_finalnew_vol2_cp14.pdf
Chapter 14 Program Evaluation and Review Technique 

Found pdf http://220.227.161.86/21534sm_finalnew_vol2_cp15.pdf
Chapter 15 Simulation 

Found pdf http://220.227.161.86/21535sm_finalnew_vol2_cp16.pdf
Chapter 16 Learning Curve Theory 

Found pdf http://220.227.161.86/31903sm_finalnew_cp-appendix-pmvolab.pdf
Appendix

Found pdf http://220.227.161.86/31902sm_finalnew_cp-feedbackformvolB.pdf
Feedback Form