Python:读取隐藏HTML表的内容

时间:2014-11-21 15:00:38

标签: python beautifulsoup

在此网页上有一个“显示研究位置”选项卡,当我单击选项卡时,它会显示整个位置列表并更改我在此程序中包含的网址。当我运行程序打印出整个位置列表时,我得到了这个结果:

soup = BeautifulSoup(urllib2.urlopen('https://clinicaltrials.gov/ct2/show/study/NCT01718158?term=NCT01718158&rank=1&show_locs=Y#locn').read())

for row in soup('table')[5].findAll('tr'):
    tds = row('td')
    if len(tds)<2:
        continue
    print tds[0].string, tds[1].string  #, '\n'.join(filter(unicode.strip, tds[1].strings))

Local Institution None
Local Institution None
Local Institution None
Local Institution None
Local Institution None
等等......把剩下的信息留下来。我觉得我在这里错过了一些东西。我的结果应该是:

United States, California
Va Long Beach Healthcare System 
Long Beach, California, United States, 90822
United States, Georgia
Gastrointestinal Specialists Of Georgia Pc  
Marietta, Georgia, United States, 30060
United States, New York
Weill Cornell Medical College   

等等。我想打印出整个位置列表。

1 个答案:

答案 0 :(得分:0)

当地的研究所只有一个表格单元格,但你正在跳过它们。

也许您需要从所有单元格中提取数据,并且只跳过没有<td>单元格的行:

for row in soup('table')[5].findAll('tr'):
    tds = row('td')
    if not tds:
        continue
    print u' '.join([cell.string for cell in tds if cell.string])

这会产生

United States, California
Va Long Beach Healthcare System
Long Beach, California, United States, 90822  
United States, Georgia
Gastrointestinal Specialists Of Georgia Pc
Marietta, Georgia, United States, 30060  
# .... 
Local Institution
Taipei, Taiwan, 100  
Local Institution
Taoyuan, Taiwan, 333  
United Kingdom
Local Institution
London, Greater London, United Kingdom, SE5 9RS