Question

我正在学习如何使用beautifulsoup。我设法解析了html，现在我想从页面中提取链接列表。问题是我只对某些链接感兴趣，唯一想到的方法是在出现某个单词后获取所有链接。开始提取之前，我可以滴一部分汤吗？谢谢。

这就是我拥有的：

<svg>
  <circle id="theCircle" cx="150" cy="75" r="70" />
</svg>

我需要将链接的上半部分放到# import libraries import urllib2 from bs4 import BeautifulSoup import pandas as pd import os import re # specify the url quote_page = 'https://econpapers.repec.org/RAS/pab7.htm' # query the website and return the html to the variable page page = urllib2.urlopen(quote_page) # parse the html using beautiful soup and store in variable soup soup = BeautifulSoup(page, 'html.parser') print(soup) #transform to pandas dataframe pages1 = soup.find_all('li', ) print(pages1) pages2 = pd.DataFrame({ "papers": pages1, }) print(pages2)中，将我想要的链接与其余链接区分开的唯一方法是在html中出现一个单词，即“ {{1 }}“

编辑：我刚刚注意到，我也可以通过链接的开头将它们分开。我只希望以“ page2”开头的

Answer 1

可以采用多种方法来获取所有以“ / article /”开头的href。一种简单的方法是：

# import libraries
import urllib.request
from bs4 import BeautifulSoup
import os
import re
import ssl

# specify the url
quote_page = 'https://econpapers.repec.org/RAS/pab7.htm'

gcontext = ssl.SSLContext()

# query the website and return the html to the variable page
page = urllib.request.urlopen(quote_page, context=gcontext)

# parse the html using beautiful soup and store in variable soup
soup = BeautifulSoup(page, 'html.parser')

#print(soup)

# Anchor tags starting with "/article/"
anchor_tags = soup.find_all('a', href=re.compile("/article/"))

for link in anchor_tags:
    print(link.get('href'))

此answer也会有所帮助。并且，通过quick start guide of BeautifulSoup，它有一个很好的例子。

Answer 2

同样使用css_selector：

# parse the html using beautiful soup and store in variable soup
soup = BeautifulSoup(page, 'lxml')
#print(BeautifulSoup.prettify(soup))

css_selector = 'a[href^="/article"]'
href_tag_list = soup.select(css_selector)
print("Href list size:", len(href_tag_list)) # check that you found datas, do if else if needed

href_link_list = [] #use urljoin probably needed at some point
for href_tag in href_tag_list:
    href_link_list.append(href_tag['href'])
    print("href:", href_tag['href'])

我使用了这个参考网页，它是由另一个stackflow用户提供的： Web Link

注意：您必须将清单“ / article /”取消。

滴一部分汤

2 个答案: