Question

当我使用selenium + phantomjs来抓取非ascii字符中的url时，结果页面源不是我想要获取的页面，我在使用phantomjs获取url之前对url进行编码。

def phantomTianyanchaSearch(name):
    by = webdriver.PhantomJS()
    url = 'http://www.tianyancha.com/search/' + urllib.quote_plus(name) + '?checkFrom=searchBox'
    by.get(url)
    f = open('logs/phantom_tianyancha'+name+'.html', 'w')
    page = by.page_source.encode('utf-8')
    f.write(page)
    f.close()
    print url, '\tgot', len(page)
    print by.current_url
    by.quit()

if __name__ == '__main__':
   # phantomYaozh()
   # chromeTianyancha('1218773262')
   companyids = ['1218773262', '719792175', '2347856627', '2342953041']
   # phantomTianyanchas(companyids)
   phantomTianyanchaSearch('汕头金石制药总厂')

这是打印结果：

http://www.tianyancha.com/search/%E6%B1%95%E5%A4%B4%E9%87%91%E7%9F%B3%E5%88%B6%E8%8D%AF%E6%80%BB%E5%8E%82?checkFrom=searchBox   got 51294
http://www.tianyancha.com/?from=i

所以，问题是phantomjs没有得到正确的页面，似乎phantomjs无法识别中文无聊，所以它什么都没搜索。我该如何解决？

selenium webdriver phantomjs url with non-ascii character

0 个答案: