如何抓取实际的超链接并将其与其他列一起显示在CSV文件上

时间:2019-04-23 17:36:43

标签: python-3.x pandas web-scraping beautifulsoup

Hello Stackoverflow社区,

我正在尝试对特定网站进行爬网(下面的链接),并使用预定义标题下的数据生成csv文件(请参见代码)。链接是指向每天都有新数据的页面(它将被覆盖)。

问题:

无法弄清楚如何在href标记中包含链接,因此将为每个订单项打印该链接。请帮助修复它和/或提出更好的解决方案。非常感谢您的帮助。

链接:https://buyandsell.gc.ca/procurement-data/search/site?f%5B0%5D=sm_facet_procurement_data%3Adata_data_tender_notice&f%5B1%5D=dds_facet_date_published%3Adds_facet_date_published_today

尝试遵循Internet上的示例,并复制其他代码行中使用的方法(在有问题的代码上方)。没有生成文件和/或列没有与每个订单项关联的链接。

import pandas as pd
from bs4 import BeautifulSoup
import requests
import os

filename = "BuyandSell_V3.csv"

# Initialize an empty 'results' dataframe
results = pd.DataFrame()

# Iterarte through the pages
for page in range(0,20):
    url = 'https://buyandsell.gc.ca/procurement-data/search/site?page=' + str(page) + '&f%5B0%5D=sm_facet_procurement_data%3Adata_data_tender_notice&f%5B1%5D=dds_facet_date_published%3Adds_facet_date_published_today'

    page_html = requests.get(url).text
    page_soup = BeautifulSoup(page_html, "html.parser")
    containers = page_soup.findAll("div",{"class":"rc"})

    # Get data from each container
    if containers != []:
        for each in containers:
            title = each.find('h2').text.strip()
            publication_date = each.find('dd', {'class':'data publication-date'}).text.strip()
            closing_date = each.find('dd', {'class':'data date-closing'}).text.strip()
            gsin = each.find('dd', {'class':'data gsin'}).text.strip()
            notice_type = each.find('dd', {'class':'data php'}).text.strip()
            procurement_entity = each.find('dd', {'data procurement-entity'}).text.strip()
            link = each.find('a', {'href': 'data link'})

            # Create 1 row dataframe
            temp_df = pd.DataFrame([[title, publication_date, closing_date, gsin, notice_type, procurement_entity, link]], columns = ['Title', 'Publication Date', 'Closing Date', 'GSIN', 'Notice Type', 'Procurement Entity', 'Link'])

            # Append that row to a 'results' dataframe
            results = results.append(temp_df).reset_index(drop=True)
        print ('Aquired page ' + str(page+1))

    else:
        print ('No more pages')
        break


# If already have a file saved
if os.path.isfile(filename):

    # Read in previously saved file
    df = pd.read_csv(filename)

    # Append the newest results
    df = df.append(results).reset_index()

    # Drop and duplicates (incase the newest results aren't really new)
    df = df.drop_duplicates()

    # Save the previous file, with appended results
    df.to_csv(filename, index=False)

else:

    # If a previous file not already saved, save a new one
    df = results.copy()
    df.to_csv(filename, index=False)
```````````````

1 个答案:

答案 0 :(得分:0)

尝试更换

  

link = each.find('a',{'href':'data link'})

  

link = each.find('a')['href']

原因:整个块中只有1个带有标签A的链接。标签A也没有标识符。因此,您无需使用任何标识符。

  

{'href':'数据链接'}

这也是一个标识符/过滤器,但是在这种情况下,您需要将href的值属性化,所以这不起作用。