Question

我正在使用beautifulsoup从网址页面中提取href链接，然后将网址名称和提取的链接附加到列表中创建列表。例如，对于每个URL，我想访问该页面并在链接中提取URL，然后附加到列表中的列表以创建：

[['www.example.com', 'www.example.com/extractedlink1', 'www.example.com/extractedlink2'],['www.apple.com', 'www.apple.com/exractedlink1']...]

我遇到问题的部分是将两个元素附加到列表中的列表中。下面，url_list是一个列表，其中包含要提取的网址（）['www.example.com', 'www.apple.com'....]

url_and_extracted = []

for i in range(0,len(url_list)):
    url = url_list[i]
    driver = webdriver.PhantomJS()
    driver.get(url)
    time.sleep(2)
    html = driver.page_source
    driver.close()
    soup = BeautifulSoup(html, "html.parser")
    for div in soup.find_all("div", attrs={"class" : "article-content entry-content"}):
        url_and_extracted.append([url_list[i],str(div.find("a")['href'])])

但我不认为最后一部分是正确的，当从一个网址中提取多个链接时，这会导致多个列表具有相同的原始网址。我想要的是列表中的一个列表，其中包含原始网址和提取的href。

Answer 1

使用dict来映射网址的关系：

{'www.example.com': ['www.example.com/extractedlink1', 'www.example.com/extractedlink2']}

你编码不起作用的原因是这部分：

for div in soup.find_all("div", attrs={"class" : "article-content entry-content"}):
        url_and_extracted.append([url_list[i],str(div.find("a")['href'])])

您应该在url_and_extracted的每个版本中追加i，而不是div。

代码：

from collections import defaultdict
url_and_extracted = []
# for i in range(0,len(url_list)):
for i in in url_list:
    d = defaultdict(list)
    driver = webdriver.PhantomJS()
    driver.get(url)
    time.sleep(2)
    html = driver.page_source
    driver.close()
    soup = BeautifulSoup(html, "html.parser")
    for div in soup.find_all("div", attrs={"class" : "article-content entry-content"}):
        d[i].append(div.find("a")['href'])
    url_and_extracted.append(d)

Answer 2

如何收集链接的简单方法，然后在将其添加到主列表之前添加网址：

url_and_extracted = []
driver = webdriver.PhantomJS()

for url in url_list:
   links = [] # collect the links here
   driver.get(url)
   time.sleep(2)
   html = driver.page_source
   soup = BeautifulSoup(html, "html.parser")
   for div in soup.find_all("div", attrs={"class" : "article-content entry-content"}):
       links.append(div.find("a")['href'])

   url_and_extracted.append([url]+links) # add the url with [url] + links
                                         # to the main list.

如何将两个不同的元素附加到列表中

2 个答案: