我使用以下代码报废了一个网站。
网站的结构采用某种方式,需要使用4种不同的类来抓取所有数据,从而导致某些数据被重复。
为了将变量转换为列表,我尝试使用split('')方法,但它仅为每个以/ n开头的报废字符串创建一个列表。 我还尝试将变量创建为空列表,例如api_name = [],但没有用。
为了删除重复项,我考虑过使用set方法,但我认为它仅适用于列表。
我想在将变量中所有重复的数据写入CSV文件之前将其删除,我是否必须先将它们转换为列表,还是可以直接从变量中将其删除?
我们将为您的代码提供任何帮助或反馈。
谢谢。
import requests
from bs4 import BeautifulSoup
import csv
url = "https://www.programmableweb.com/apis/directory"
api_no = 0
urlnumber = 0
response = requests.get(url)
data = response.text
soup = BeautifulSoup(data, "html.parser")
csv_file = open('api_scrapper.csv', 'w')
csv_writer = csv.writer(csv_file)
csv_writer.writerow(['api_no', 'API Name', 'Description','api_url', 'Category', 'Submitted'])
#THis is the place where I parse and combine all the classes, which causes the duplicates data
directories1 = soup.find_all('tr', {'class': 'odd'})
directories2 = soup.find_all('tr', {'class': 'even'})
directories3 = soup.find_all('tr', {'class': 'odd views-row-first'})
directories4 = soup.find_all('tr', {'class': 'odd views-row-last'})
directories = directories1 + directories2 + directories3 + directories4
while urlnumber <= 765:
for directory in directories:
api_NameTag = directory.find('td', {'class':'views-field views-field-title col-md-3'})
api_name = api_NameTag.text if api_NameTag else "N/A"
description_nametag = directory.find('td', {'class': 'col-md-8'})
description = description_nametag.text if description_nametag else 'N/A'
api_url = 'https://www.programmableweb.com' + api_NameTag.a.get('href')
category_nametage = directory.find('td',{'class': 'views-field views-field-field-article-primary-category'})
category = category_nametage.text if category_nametage else 'N/A'
submitted_nametag = directory.find('td', {'class':'views-field views-field-created'})
submitted = submitted_nametag.text if submitted_nametag else 'N/A'
#These are the variables I want to remove the duplicates from
csv_writer.writerow([api_no,api_name,description,api_url,category,submitted])
api_no +=1
urlnumber +=1
url = "https://www.programmableweb.com/apis/directory?page=" + str(urlnumber)
csv_file.close()
答案 0 :(得分:1)
如果不是用于api链接的话,我会说只使用pandas read_html并获取索引2。因为您也想要url,所以我建议您更改选择器。您希望限制在表中以避免重复,并选择描述该列的类名。
import pandas as pd
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://www.programmableweb.com/apis/directory')
soup = bs(r.content, 'lxml')
api_names, api_links = zip(*[(item.text, 'https://www.programmableweb.com' + item['href']) for item in soup.select('.table .views-field-title a')])
descriptions = [item.text for item in soup.select('td.views-field-search-api-excerpt')]
categories = [item.text for item in soup.select('td.views-field-field-article-primary-category a')]
submitted = [item.text for item in soup.select('td.views-field-created')]
df = pd.DataFrame(list(zip(api_names, api_links, descriptions, categories, submitted)), columns = ['API name','API Link', 'Description', 'Category', 'Submitted'])
print(df)
尽管您可以做到
pd.read_html(url)[2]
,然后使用上面显示的选择器为bs4中的api_links添加额外的列。