我正在学习如何使用beautifulsoup。我设法解析了html,现在我想从页面中提取链接列表。问题是我只对某些链接感兴趣,唯一想到的方法是在出现某个单词后获取所有链接。开始提取之前,我可以滴一部分汤吗?谢谢。
这就是我拥有的:
<svg>
<circle id="theCircle" cx="150" cy="75" r="70" />
</svg>
我需要将链接的上半部分放到# import libraries
import urllib2
from bs4 import BeautifulSoup
import pandas as pd
import os
import re
# specify the url
quote_page = 'https://econpapers.repec.org/RAS/pab7.htm'
# query the website and return the html to the variable page
page = urllib2.urlopen(quote_page)
# parse the html using beautiful soup and store in variable soup
soup = BeautifulSoup(page, 'html.parser')
print(soup)
#transform to pandas dataframe
pages1 = soup.find_all('li', )
print(pages1)
pages2 = pd.DataFrame({
"papers": pages1,
})
print(pages2)
中,将我想要的链接与其余链接区分开的唯一方法是在html中出现一个单词,即“ {{1 }}“
编辑:我刚刚注意到,我也可以通过链接的开头将它们分开。我只希望以“ page2
”开头的
答案 0 :(得分:2)
可以采用多种方法来获取所有以“ / article /”开头的href。一种简单的方法是:
# import libraries
import urllib.request
from bs4 import BeautifulSoup
import os
import re
import ssl
# specify the url
quote_page = 'https://econpapers.repec.org/RAS/pab7.htm'
gcontext = ssl.SSLContext()
# query the website and return the html to the variable page
page = urllib.request.urlopen(quote_page, context=gcontext)
# parse the html using beautiful soup and store in variable soup
soup = BeautifulSoup(page, 'html.parser')
#print(soup)
# Anchor tags starting with "/article/"
anchor_tags = soup.find_all('a', href=re.compile("/article/"))
for link in anchor_tags:
print(link.get('href'))
此answer也会有所帮助。并且,通过quick start guide of BeautifulSoup,它有一个很好的例子。
答案 1 :(得分:2)
同样使用css_selector:
# parse the html using beautiful soup and store in variable soup
soup = BeautifulSoup(page, 'lxml')
#print(BeautifulSoup.prettify(soup))
css_selector = 'a[href^="/article"]'
href_tag_list = soup.select(css_selector)
print("Href list size:", len(href_tag_list)) # check that you found datas, do if else if needed
href_link_list = [] #use urljoin probably needed at some point
for href_tag in href_tag_list:
href_link_list.append(href_tag['href'])
print("href:", href_tag['href'])
我使用了这个参考网页,它是由另一个stackflow用户提供的: Web Link
注意:您必须将清单“ / article /”取消。