我以前从未使用过Python,因此请原谅我缺乏知识,但是我正试图为所有线程创建一个xenforo论坛。到目前为止,到目前为止,除了它为同一线程的每个页面选择了多个URL之外,我还发布了一些数据来解释我的意思。
forums/my-first-forum/: threads/my-gap-year-uni-story.13846/
forums/my-first-forum/: threads/my-gap-year-uni-story.13846/page-9
forums/my-first-forum/: threads/my-gap-year-uni-story.13846/page-10
forums/my-first-forum/: threads/my-gap-year-uni-story.13846/page-11
真的,我最想刮的只是其中之一。
forums/my-first-forum/: threads/my-gap-year-uni-story.13846/
这是我的剧本:
from bs4 import BeautifulSoup
import requests
def get_source(url):
return requests.get(url).content
def is_forum_link(self):
return self.find('special string') != -1
def fetch_all_links_with_word(url, word):
source = get_source(url)
soup = BeautifulSoup(source, 'lxml')
return soup.select("a[href*=" + word + "]")
main_url = "http://example.com/forum/"
forumLinks = fetch_all_links_with_word(main_url, "forums")
forums = []
for link in forumLinks:
if link.has_attr('href') and link.attrs['href'].find('.rss') == -1:
forums.append(link.attrs['href']);
print('Fetched ' + str(len(forums)) + ' forums')
threads = {}
for link in forums:
threadLinks = fetch_all_links_with_word(main_url + link, "threads")
for threadLink in threadLinks:
print(link + ': ' + threadLink.attrs['href'])
threads[link] = threadLink
print('Fetched ' + str(len(threads)) + ' threads')
答案 0 :(得分:1)
此解决方案假定应该从url中删除以检查唯一性的内容始终是“ / page-#...”。如果不是这种情况,则此解决方案将不起作用。
您可以使用一个集合来代替使用列表来存储您的网址,该集合只会添加唯一的值。然后在url中删除最后一个“ page”实例及其后的所有内容(如果格式为“ / page-#”,其中#是任意数字),然后再将其添加到集合中。
forums = set()
for link in forumLinks:
if link.has_attr('href') and link.attrs['href'].find('.rss') == -1:
url = link.attrs['href']
position = url.rfind('/page-')
if position > 0 and url[position + 6:position + 7].isdigit():
url = url[:position + 1]
forums.add(url);