Question

我想从包含以下格式的网址的网站中提取一些信息： http://www.pedigreequery.com/american+pharoah 其中“美国+法老”是许多马名之一的延伸。我有一个我正在搜索的马名列表，我只需要弄清楚如何在“http://www.pedigreequery.com/”之后插入名称

这就是我目前所拥有的：

import csv
allhorses = csv.reader(open('HORSES.csv') )
rows=list(allhorses)

import requests 
from bs4 import BeautifulSoup
for i in rows:      # Number of pages plus one 
    url = "http://www.pedigreequery.com/".format(i)
    r = requests.get(url)
    soup = BeautifulSoup(r.content)
    letters = soup.find_all("a", class_="horseName")
    print(letters)

当我打印出网址时，它最后没有马的名字，只是引号中的网址。最后的字母/打印声明只是为了检查它是否真的进入了网站。这就是我看到它为循环最后按数字更改的URL所做的工作 - 我还没有找到有关按字符更改的URL的建议。

谢谢！

Answer 1

您缺少格式的占位符，因此请将格式扫描到：

url = "http://www.pedigreequery.com/{}".format(i)
                                     ^
                                   #add placeholder

此外，您最多可以从rows=list(allhorses)获取列表列表，这样您就可以传递一个列表而不是字符串/ horsename，如果每行都有一匹马并且迭代文件对象，则只需正常打开文件剥离换行符。

每行假设一匹马名称，整个工作代码为：

import requests
from bs4 import BeautifulSoup

with open("HORSES.csv") as f:
    for horse in map(str.strip,f):      # Number of pages plus one
        url = "http://www.pedigreequery.com/{}".format(horse)
        r = requests.get(url)
        soup = BeautifulSoup(r.content)
        letters = soup.find_all("a", class_="horseName")
        print(letters)

如果你每行有多匹马，你可以使用csv lib，但你需要一个内循环：

with open("HORSES.csv") as f:
    for row in csv.reader(f):   
        # Number of pages plus one
        for horse in row:
            url = "http://www.pedigreequery.com/{}".format(horse)
            r = requests.get(url)
            soup = BeautifulSoup(r.content)
            letters = soup.find_all("a", class_="horseName")
            print(letters)

最后，如果你没有正确存储名称，你可以选择最简单的几个选项，然后手动创建查询。

  url = "http://www.pedigreequery.com/{}".format("+".join(horse.split()))

循环使用BeautifulSoup进行网页抓取的网址列表

1 个答案: