Question

我需要提取网站中的链接数量，例如 https://stackoverflow.com/questions/ask（仅作为示例）

我试过用urlparse来提取url信息，然后是美汤。

domain_name = urlparse(url).netloc
soup = BeautifulSoup(requests.get(url).content, "html.parser")

我需要将网站中每个网站的所有链接保存在一个列表中。我想要这样的东西：

URL                                            Links
    https://stackoverflow.com/questions/ask    ['link1','link2','link3',...]
    https://anotherwebsite.com/sport           ['link1','link2','link3','link4']
    https://last_example.es                    []

你能解释一下如何得到类似的结果吗？

Answer 1

让我们试试：

plot_ly(data=df.data, x = ~date, y = ~variable, z = ~value, type="scatter3d", mode='lines', split=~variable, color=I('black')) %>% layout(showlegend = FALSE)

输出：

def get_all_links(url):
    # of course one needs to deal with the case when `requests` fails
    # but that's outside the scope here
    soup = BeautifulSoup(requests.get(url).content, "html.parser")

    return [a.attrs.get('href', '') for a in soup.find_all('a')]

# sample data
df = pd.DataFrame({'URL':['https://stackoverflow.com/questions/ask']})


df['Links'] = df['URL'].apply(get_all_links)

从网站中提取链接数

1 个答案: