Question

此脚本正在生成一个csv，其中只有一个网址中的数据被输入。意味着有98组结果，但for循环没有超过第一个网址。

我今天已经工作了12小时+，为了得到正确的结果，我错过了什么？

导入请求进口重新来自bs4进口BeautifulSoup import csv

#Read csv
csvfile = open("gyms4.csv")
csvfilelist = csvfile.read()

def get_page_data(urls):
    for url in urls:
        r = requests.get(url.strip())
        soup = BeautifulSoup(r.text, 'html.parser')
        yield soup    # N.B. use yield instead of return

print r.text

with open("gyms4.csv") as url_file:
    for page in get_page_data(url_file):
        name = page.find("span",{"class":"wlt_shortcode_TITLE"}).text
        address = page.find("span",{"class":"wlt_shortcode_map_location"}).text
        phoneNum = page.find("span",{"class":"wlt_shortcode_phoneNum"}).text
        email = page.find("span",{"class":"wlt_shortcode_EMAIL"}).text

        th = pages.find('b',text="Category")
        td = th.findNext()
        for link in td.findAll('a',href=True):
            match = re.search(r'http://(\w+).(\w+).(\w+)', link.text)
            if match:
                web_address = link.text

gyms = [name,address,phoneNum,email,web_address]
gyms.append(gyms)

#Saving specific listing data to csv
with open ("xgyms.csv", "w") as file:
    writer = csv.writer(file)
    for row in gyms:
        writer.writerow([row])

Answer 1

您的代码中有3个for循环，并且没有指定哪个导致问题。我认为它是get_page_date()函数中的那个。

您使用return assignemt完全在第一次运行中离开了looop。这就是为什么你永远不会到达第二个网址。

至少有两种可能的解决方案：

将每个解析后的url行附加到列表中并返回该列表。
在循环中移动处理代码并将解析后的数据附加到循环中的gyms。

Answer 2

正如Alex.S所说，get_page_data()在第一次迭代时返回，因此永远不会访问后续URL。此外，需要为下载的每个页面执行从页面提取数据的代码，因此它也需要处于循环中。您可以将def get_page_data(urls): for url in urls: r = requests.get(url.strip()) soup = BeautifulSoup(r.text, 'html.parser') yield soup # N.B. use yield instead of return with open("gyms4.csv") as url_file: for page in get_page_data(url_file): name = page.find("span",{"class":"wlt_shortcode_TITLE"}).text address = page.find("span",{"class":"wlt_shortcode_map_location"}).text phoneNum = page.find("span",{"class":"wlt_shortcode_phoneNum"}).text email = page.find("span",{"class":"wlt_shortcode_EMAIL"}).text # etc. etc.转换为生成器，然后迭代遍历这些页面：

csv.writer.writerows()

您可以在下载和处理每个页面时将数据写入CSV文件，或者您可以将数据累积到列表中并将其写入一个用于get_page_data()的文件。

此外，您应该将URL列表传递给'cipher' => 'something stands here(delete it)',，而不是从全局变量访问它。

如何从所有网址中提取数据，而不仅仅是第一个网址

2 个答案: