Question

我正在尝试使用BeautifulSoup制作Python Crawler，但是我收到一个错误，我正在尝试将非String或其他字符缓冲区类型写入文件。从检查程序输出，我发现我的列表包含许多无项目。除了没有，我还有很多图像和东西不是链接，但是我的列表中的图像链接。我怎样才能将URL添加到我的列表中？

    import urllib
    from BeautifulSoup import *

    try:
        with open('url_file', 'r') as f:
            url_list = [line.rstrip('\n') for line in f]
            f.close()
        with open('old_file', 'r') as x:
            old_list = [line.rstrip('\n') for line in f]
            f.close()
    except:
        url_list = list()
        old_list = list()
        #for Testing
        url_list.append("http://www.dinamalar.com/")


    count = 0


    for item in url_list:
        try:
            count = count + 1
            if count > 5:
                break

            html = urllib.urlopen(item).read()
            soup = BeautifulSoup(html)
            tags = soup('a')

            for tag in tags:

                if tag in old_list:
                    continue
                else:
                    url_list.append(tag.get('href', None))


            old_list.append(item)
            #for testing
            print url_list
        except:
            continue

    with open('url_file', 'w') as f:
        for s in url_list:
            f.write(s)
            f.write('\n')


    with open('old_file', 'w') as f:
        for s in old_list:
            f.write(s)

Answer 1

首先，使用bs4而不是不再维护的 BeautifulSoup3 ，您的错误是因为并非所有锚点都有href，因此您尝试编写无导致您的错误，使用 find_all 并设置 href = True ，这样您才能找到具有href属性的锚标记：

soup = BeautifulSoup(html)
tags = soup.find_all("a", href=True)

也永远不要使用毯子除了语句，总是抓住你期望的错误，至少在它们发生时打印它们。至于我也有很多图像和非链接的东西，如果你想过滤某些链接然后你必须更具体，要么寻找包含你的标签如果可能，请使用正则表达式href=re.compile("some_pattern")或使用css选择器：

# hrefs starting with something
"a[href^=something]"

# hrefs that contain something
"a[href*=something]"

# hrefs ending with  something
"a[href$=something]"

只有你知道html的结构和你想要的东西，所以你使用的完全取决于你自己决定。

Python - 使用BeautifulSoup创建URL列表时出现问题

1 个答案: