Question

http://www.wfri.re.kr/client/PublishHp.do?command=view&list_dis_txt=PUB&current_page=1&isu_year=all&list_unq_no=RP00000001847&search_category=&search_keyword=&pub_dt=20170203&topMenuNo=H20000&leftMenuNo=H20100

我正在抓取这个网站。

我正在使用Python3和Beautifulsoup

我的抓取工具在这里找不到任何标签。

我想在这里下载pdf文件。

Beautifulsoup无法从此网站上抓取任何标签。

为什么？

def second_crawler(second_url):
    second_url = 'http://www.wfri.re.kr/client/PublishHp.do?command=view&list_dis_txt=PUB&current_page=1&isu_year=all&list_unq_no=RP00000001847&search_category=&search_keyword=&pub_dt=20170203&topMenuNo=H20000&leftMenuNo=H20100'
    source_code = requests.get(second_url)
    plain_text = source_code.text
    soup = BeautifulSoup(plain_text, 'lxml')
    print(soup)  # for debug
    # tdTag = soup.findAll('td',class_='view_cont')
    # print(len(tdTag))    ## result is 0. Why??

Answer 1

网站使用javascript函数javascript:fnc_filedown()代替URL来提供PDF文件的下载功能。

例如，当我访问其中一个帖子时：http://www.wfri.re.kr/client/PublishHp.do?command=view&list_dis_txt=PUB&current_page=1&isu_year=all&list_unq_no=RP00000001847&search_category=&search_keyword=&pub_dt=20170203&topMenuNo=H20000&leftMenuNo=H20100

只能使用以下行触发下载过程：

javascript:fnc_filedown( 'XXX.pdf', '148636884482283162132' );

因为参考链接存储在此处：

<a href = "javascript:fnc_filedown( 'XXX.pdf', '148636884482283162132' );" class="link01">XXX.pdf</a>

建议尝试根据网站样式修改您的抓取工具。

我在HTML中找不到Tag

1 个答案: