我正在学习从网上抓取文字。我写了以下函数
from bs4 import BeautifulSoup
import requests
def get_url(source_url):
r = requests.get(source_url)
data = r.text
#extract HTML for parsing
soup = BeautifulSoup(data, 'html.parser')
#get H3 tags with class ...
h3list = soup.findAll("h3", { "class" : "entry-title td-module-title" })
#create data structure to store links in
ulist = []
#pull links from each article heading
for href in h3list:
ulist.append(href.a['href'])
return ulist
我是从一个单独的文件中调用它...
from print1 import get_url
ulist = get_url("http://www.startupsmart.com.au/")
print(ulist[3])
问题是我使用的css选择器对于我正在解析的站点是非常独特的。所以功能有点'脆弱'。我想将css选择器作为参数传递给函数
如果我在函数定义中添加一个参数
def get_url(source_url, css_tag):
并尝试传递"h3", { "class" : "entry-title td-module-title" }
它出现了
TypeError:get_url()只取1个参数(给定2个)
我试图转义所有引号,但它仍无效。
我真的很感激一些帮助。我无法找到这个答案。
答案 0 :(得分:0)
这是一个有效的版本:
from bs4 import BeautifulSoup
import requests
def get_url(source_url, tag_name, attrs):
r = requests.get(source_url)
data = r.text
# extract HTML for parsing
soup = BeautifulSoup(data, 'html.parser')
# get H3 tags with class ...
h3list = soup.findAll(tag_name, attrs)
# create data structure to store links in
ulist = []
# pull links from each article heading
for href in h3list:
ulist.append(href.a['href'])
return ulist
ulist = get_url("http://www.startupsmart.com.au/", "h3", {"class": "entry-title td-module-title"})
print(ulist[3])