Web爬网-递归调用来自href的链接,遍历商店数据

时间:2019-04-17 07:55:09

标签: python html python-3.x web-scraping beautifulsoup

因此,我正在创建一个访问网站的网页抓取工具,并使用li elements查找所有div。 然后遍历li并将其存储在text file中。

现在,该网站还在某些地方嵌入了链接。

因此,我需要遍历链接并在其中找到li元素,然后返回到父页面。 pageStructure

我的代码在下面共享

import urllib
import urllib.request
from bs4 import BeautifulSoup


def writeToFile(ul):
    for li in ul:
        with open('path/to/file.txt', 'a+') as f:
            text = li.text
            f.write(text + ',')
            f.close()


def searchElements(url):
    print(url)
    response = urllib.request.urlopen(url)
    html = response.read()

    soup = BeautifulSoup(html, 'html.parser')

    divs = soup.findAll('div', id=lambda x: x and x.startswith('mntl-sc-block_1-0-'))
    for div in divs:
        ul = div.find("ul")
        if ul is not None:
            ulVariable = ul.findAll('a')
            for b in ulVariable:
                if ulVariable is not None:
                    if b is not None:
                        linkItemsList = list()
                        links = (b.get("href"))
                        linkItemsList.append(links)
                        for link in linkItemsList:
                            searchElements(link)
                            print('link internal data print')
                            writeToFile(ul)
                else:
                    print('link in not none else')
                    writeToFile(ul)
        print('all non link')
        writeToFile(ul)


def main():
    searchElements('https://www.thebalancecareers.com/list-of-information-technology-it-skills-2062410')


if __name__ == '__main__':
    main()

我没有为递归调用提供适当的逻辑。我陷入了子页面中。

因此,我非常感谢收到的任何帮助

1 个答案:

答案 0 :(得分:1)

我认为您的代码被卡住的主要原因是因为某些网页链接指向您已访问的代码页面;这会创建一个无限循环,而递归调用将永远挂起。

为避免这种情况,您需要跟踪已访问的链接;您可以像下面的代码一样使用列表进行操作。

下面的代码到达搜索的结尾,但是要做到这一点,需要注意以下几点:

  1. 某些页面链接到外部链接已损坏;这就是为什么我放置一个try, except子句的原因(否则:错误...)

  2. 某些文本(据我检查,至少有一个)具有特殊字符-'\ u200b'-这使文件写烦恼,这就是为什么我将open更改为codecs.open并对其进行编码以便能够管理它。

  3. 至少一个链接重定向到https://web.archive.org/ ...(下面的代码),因此我使用正则表达式将其更改回www.thebalancecareers.com/。如果您不打算使用这些链接,则必须修改代码。

  4. 最后,我评论了最后一个writeToFile(ul),因为它向文件写入None导致错误。

我希望这会有所帮助。

import urllib
import urllib.request
from bs4 import BeautifulSoup
import codecs
import re

def writeToFile(ul):
    for li in ul:
        # codecs.open with encoding can manage some spacial characters in
        # you text such as '\u200b' 
        with codecs.open('file.txt', encoding='utf-8', mode='a+') as f:
            text = li.text
            f.write(text + ',')
            f.close()

def searchElements(url, visitedurls):
    visitedurls.append(url)
    print(url)
    # Uncomment the following line to see all link visited by searchElements
    # print(visitedurls)

    # Some external links referenced by www.thebalancecareers.com
    # don't exist anymore or are forbidden
    try:
        response = urllib.request.urlopen(url)
    except (urllib.error.HTTPError, urllib.error.URLError):
        return

    html = response.read()

    soup = BeautifulSoup(html, 'html.parser')

    divs = soup.findAll('div', id=lambda x: x and x.startswith('mntl-sc-block_1-0-'))
    for div in divs:
        ul = div.find("ul")
        if ul is not None:
            ulVariable = ul.findAll('a')
            for b in ulVariable:
                if ulVariable is not None:
                    if b is not None:
                        linkItemsList = list()
                        links = (b.get("href"))
                        linkItemsList.append(links)
                        for link in linkItemsList:
                            # Get rid of this kind of link:
                            # https://web.archive.org/web/20180425012458/https:/www.thebalancecareers.com/....
                            link = re.sub(r'https://web.archive.org/(.*)/https:', 'https:/', link)
                            if link in visitedurls:
                                print('%s already traversed' % link)
                                return
                            else:
                                searchElements(link, visitedurls)
                                print('link internal data print')
                                writeToFile(ul)
                else:
                    print('link in not none else')
                    writeToFile(ul)
        print('all non link')

        # Commented: this would try to write None
        # resulting in an error
        #writeToFile(ul)

def main():
    visitedurls = []
    searchElements('https://www.thebalancecareers.com/list-of-information-technology-it-skills-2062410', visitedurls)


if __name__ == '__main__':
    main()