使用python从网站的主页获取链接

时间:2017-06-29 20:24:38

标签: python beautifulsoup

我想编写一个脚本来获取主页的社交媒体链接(主要是Twitter / facebook),而且我完全陷入困境,因为我对Python很新。

我想要完成的任务是解析网站,找到社交媒体链接,并将其保存在新数据框中,其中每列将包含原始URL,Twitter链接和facebook链接。以下是纽约时报网站迄今为止的代码:

from bs4 import BeautifulSoup
import requests

url = "http://www.nytimes.com"
r = requests.get(url)
sm_sites = ['twitter.com','facebook.com']

soup = BeautifulSoup(r.content, 'html5lib')
all_links = soup.find_all('a', href = True)


for site in sm_sites:
    if all(site in sm_sites for link in all_links):
        print(site)
    else:
        print('no link')

我在理解循环正在做什么或者如何让它适合我需要它时遇到一些问题。我也试图存储网站而不是print(site),但这不起作用......所以我想我会寻求帮助。在询问之前,我在这里经历了一系列回复,但没有人能让我做我需要做的事情。

2 个答案:

答案 0 :(得分:4)

此代码的工作方式,您已经拥有链接。您的首页链接是url的起点,http://www.nytimes.com 你有社交媒体网址sm_sites = ['twitter.com','facebook.com'],你所做的就是确认它们存在于主页面上。如果要保存已确认的社交媒体网址列表,请将其附加到列表

以下是从网页上获取社交媒体链接的一种方法

import requests
from bs4 import BeautifulSoup

url = "https://stackoverflow.com/questions/tagged/python"
r = requests.get(url)
sm_sites = ['twitter.com','facebook.com']
sm_sites_present = []

soup = BeautifulSoup(r.content, 'html5lib')
all_links = soup.find_all('a', href = True)


for sm_site in sm_sites:
    for link in all_links:
        if sm_site in link.attrs['href']:
            sm_sites_present.append(link.attrs['href'])

print(sm_sites_present)

输出:

['https://twitter.com/stackoverflow', 'https://www.facebook.com/officialstackoverflow/']

<强>更新
为df的网址

import requests
import pandas as pd
from bs4 import BeautifulSoup
from IPython.display import display

urls = [
    "https://stackoverflow.com/questions/tagged/python",
    "https://www.nytimes.com/",
    "https://en.wikipedia.org/"
]

sm_sites = ['twitter.com','facebook.com']
sm_sites_present = []
columns = ['url'] + sm_sites
df = pd.DataFrame(data={'url' : urls}, columns=columns)

def get_sm(row):
    r = requests.get(row['url'])
    output = pd.Series()

    soup = BeautifulSoup(r.content, 'html5lib')
    all_links = soup.find_all('a', href = True)
    for sm_site in sm_sites:
        for link in all_links:
            if sm_site in link.attrs['href']:
                output[sm_site] = link.attrs['href']
    return output

sm_columns = df.apply(get_sm, axis=1)
df.update(sm_columns)
df.fillna(value='no link')

输出enter image description here

答案 1 :(得分:0)

对于将其添加到DataFrame,这将执行您想要的操作。您可以遍历网站列表(urlsToSearch),为包含基本网站,所有Facebook链接和所有Twitter链接的每个网站的数据框添加一行。

from bs4 import BeautifulSoup
import requests
import pandas as pd

df = pd.DataFrame(columns=["Website", "Facebook", "Twitter"])

urlsToSearch = ["http://www.nytimes.com","http://www.businessinsider.com/"]

for url in urlsToSearch:
    r = requests.get(url) 

    tw_links = []
    fb_links = []

    soup = BeautifulSoup(r.text, 'html.parser')
    all_links = [link['href'] for link in soup.find_all('a', href = True)] #only get href

    for link in all_links:
        if "twitter.com" in link:
            tw_links.append(link)
        elif "facebook.com" in link:
            fb_links.append(link)

   df.loc[df.shape[0]] = [url,fb_links,tw_links] #Add row to end of df