Question

我正在学习从网上抓取文字。我写了以下函数

from bs4 import BeautifulSoup
import requests

def get_url(source_url):
    r  = requests.get(source_url)
    data = r.text
    #extract HTML for parsing
    soup = BeautifulSoup(data, 'html.parser')
    #get H3 tags with class ...
    h3list = soup.findAll("h3", { "class" : "entry-title td-module-title" })
    #create data structure to store links in
    ulist = []
    #pull links from each article heading
    for href in h3list:
        ulist.append(href.a['href'])
    return ulist

我是从一个单独的文件中调用它...

from print1 import get_url 

ulist = get_url("http://www.startupsmart.com.au/")

print(ulist[3])

问题是我使用的css选择器对于我正在解析的站点是非常独特的。所以功能有点'脆弱'。我想将css选择器作为参数传递给函数

如果我在函数定义中添加一个参数

def get_url(source_url, css_tag):

并尝试传递"h3", { "class" : "entry-title td-module-title" }

它出现了

TypeError：get_url（）只取1个参数（给定2个）

我试图转义所有引号，但它仍无效。

我真的很感激一些帮助。我无法找到这个答案。

Answer 1

这是一个有效的版本：

from bs4 import BeautifulSoup
import requests

def get_url(source_url, tag_name, attrs):
    r = requests.get(source_url)
    data = r.text
    # extract HTML for parsing
    soup = BeautifulSoup(data, 'html.parser')
    # get H3 tags with class ...
    h3list = soup.findAll(tag_name, attrs)
    # create data structure to store links in
    ulist = []
    # pull links from each article heading
    for href in h3list:
        ulist.append(href.a['href'])
    return ulist

ulist = get_url("http://www.startupsmart.com.au/", "h3", {"class": "entry-title td-module-title"})

print(ulist[3])

python传递包含引用

1 个答案: