Question

我想从网站检索包含特定短语的所有链接。

一个公共网站上的示例是从一个大型的youtube频道检索所有视频（例如Linus Tech Tips）：

from bs4 import BeautifulSoup as bs
import requests
url = 'https://www.youtube.com/user/LinusTechTips/videos'
html = requests.get(url)
soup = bs(html.content, "html.parser")
current_link = ''
for link in soup.find_all('a'):
    current_link = link.get('href')
    print(current_link)

现在我在这里有3个问题：

如何仅获取包含诸如“ watch？v =“
大多数超链接未显示。在浏览器中：当您向下滚动时，它们会出现。 BeautifulSoup只会查找无需滚动即可找到的链接。如何检索所有超链接？
所有超链接出现两次。我怎么只能选择每个超链接一次？

有什么建议吗？

Answer 1

我如何仅获取包含短语“ watch？v =“
的超链接。

在打印语句上方添加一个if语句

if 'watch?v=' in current_link:
     print(current_link)

所有超链接出现两次。我怎么只能选择每个超链接一次？

将所有超链接存储在字典中作为键并将其值设置为任意数字（字典仅允许输入一个键，因此您将无法添加重复项）

类似这样的东西：

myLinks = {}    //declare a dictionary variable to hold your data 



if 'watch?v=' in current_link:
     print(current_link)
     myLinks[currentLink] = 1

您可以像这样遍历字典中的键（链接）：

for link,val in myLinks:
    print(link)

这将打印字典中的所有链接

大多数超链接均未显示。在浏览器中：当您向下滚动时，它们会出现。 BeautifulSoup只会查找无需滚动即可找到的链接。如何检索所有超链接？

我不确定您如何直接绕过引导我们的页面上的脚本，但是您始终可以从最初的抓取中抓取您获得的链接，并从侧面板上撕开新链接/遍历它们，这应该为您提供大部分（如果不是全部）所需的链接。

要执行此操作，您希望另一个字典存储已遍历的链接/检查是否已遍历它们。您可以像这样检查字典中的键：

if key in myDict:
    print('myDict has this key already!')

Answer 2

我会使用请求库

对于python3

import urllib.request
import requests
SearchString="SampleURL.com"

response = requests.get(SearchString, stream=True)
zeta= str(response.content)
with open ("File.txt" , "w") as l:
            l.write(zeta)
            l.close()

#And now open up the file with the information written to i t

x = open("File.txt", "r")
    jello = []
    for line in x:

        jello.append(line)
    t = (jello[0].split(""""salePrice":""",1)[1].split(",",1)[0] )

#you'll notice above that I have the keyword "salePrice", this should be a unique identifier in the pages xpath. typically f12 in chrome and then navigating til the item is highlighted gives you the xpath if you right click and copy

#Now this will only return a single result, youll want to use a for loop to iterate over the File.txt until you find all the separate results

如果您需要更多帮助，希望这对我有帮助。

Answer 3

第一部分和第三部分：

创建一个列表并将链接附加到该列表：

from bs4 import BeautifulSoup as bs
import requests
url = 'https://www.youtube.com/user/LinusTechTips/videos'
html = requests.get(url)
soup = bs(html.content, "html.parser")
links = [] # see here
for link in soup.find_all('a'):
    links.append(link.get('href')) # and here

然后创建一个集合并将其转换回列表以删除重复项：

links = list(set(links))

现在返回感兴趣的项目：

clean_links = [i for i in links if 'watch?v=' in i]

第二部分：

要浏览该网站，您可能不仅需要美丽的汤。 Scrapy具有出色的API，可让您下拉页面并探索如何使用xpath解析父元素和子元素。我强烈建议您尝试Scrapy并使用交互式外壳来调整提取方法。

HELPFUL LINK

如何从不断变化的网站中获取所有包含短语的链接

3 个答案:

第一部分和第三部分：

第二部分：