Crawling links with no-follow tag and dynamically generated URLs - using urllib3

时间:2017-08-04 13:03:48

标签: python-3.x web-scraping beautifulsoup urllib3

Experts,

I am trying to crawl and download files, if they're present, from a web-site using urllib3 and BeautifulSoup. I am able to crawl the domain upto the page where the link for downloading the file exists. However, after that when I try to request the URL using urllib3, as follows, to download the linked file

http = urllib3.PoolManager();
page = http.request('GET', URL);

I end up getting only the URL string being downloaded and not the linked file. Upon inspecting the page, the URL has been marked <a rel="no-follow" href="URL">. And when I download the file linked with the URL, I end up downloading forbidden 403 response as follows,

<HTML>
<HEAD>
<TITLE>403 Forbidden</TITLE>
<BASE href="/error_docs/"><!--[if lte IE 6]></BASE><![endif]-->
</HEAD>
<BODY>
<H1>Forbidden</H1>
You do not have permission to access this document.
<P>
<HR>
<ADDRESS>
Web Server at kkssraiadmk.com
</ADDRESS>
</BODY>
</HTML>

<!--
- Unfortunately, Microsoft has added a clever new
- "feature" to Internet Explorer. If the text of
- an error's message is "too small", specifically
- less than 512 bytes, Internet Explorer returns
- its own error message. You can turn that off,
- but it's pretty tricky to find switch called
- "smart error messages". That means, of course,
- that short error messages are censored by default.
- IIS always returns error messages that are long
- enough to make Internet Explorer happy. The
- workaround is pretty simple: pad the error
- message with a big comment like this to push it
- over the five hundred and twelve bytes minimum.
- Of course, that's exactly what you're reading
- right now.
-->

However, when I click the corresponding link from inspector console, I am able to view the file in the web browser.

Another thing is that, URL string is different for each refresh, as observed through "inspect element". As far my understanding goes the URL is being generated dynamically on every request.

My questions are as follows:

Q.1: Is it possible to follow the URL which has been marked "no-follow" using urllib3 and parse the contents or download the file? If so, how can that be done? I've tried supplying user-agent, as follows,

user_agent = {"user-agent": "Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0"}

but that doesn't solve the problem, I think due to the dynamically generated URL strings

Q.2 Even, if I can follow the "no-follow" link, how to fetch the file linked with the URL string,

/39f6f7c6abfdf34203421d37729bd20409508bd087bcdf96fa3cb7f6/doeE7UiXssBTML8DSEzmtU8MiFGpNRqg1F/1840

Any insight and help on, how to go about downloading the files would be of much help and appreciated?

Thanks in advance.

0 个答案:

没有答案