我有一个带有一列本地社区的熊猫数据框。我想做的是浏览本专栏文章,并将每个邻域相互比较,以期对数据进行序列化。当我在python shell中使用一小部分数据时,它可以正常工作:
n = pd.DataFrame({'neighborhood':['Dupont Circle', 'Adams Morgan', 'alexandria', 'west end/dupont circle', 'logan circle', 'alexandria, va', 'washington', 'adam morgan/kalorama', 'Washington DC', 'Kalorama']})
print(n)
#results
# neighborhood
#0 Dupont Circle
#1 Adams Morgan
#2 alexandria
#3 west end/dupont circle
#4 logan circle
#5 alexandria, va
#6 washington
#7 adam morgan/kalorama
#8 Washington DC
#9 Kalorama
for i in range(len(n['neighborhood'])):
for j in range(i + 1, len(n['neighborhood'])):
ratio = fw.partial_ratio(n['neighborhood'][i].lower(),n['neighborhood'][j].lower())
print(n['neighborhood'][i]+' : '+n['neighborhood'][j]+' - '+str(ratio))
if ratio>90:
n['neighborhood'][j] = n['neighborhood'][i]
print(n['neighborhood'][i]+' : '+n['neighborhood'][j])
print(n)
#results
# neighborhood
#0 Dupont Circle
#1 Adams Morgan
#2 alexandria
#3 Dupont Circle
#4 logan circle
#5 alexandria
#6 washington
#7 Adams Morgan
#8 washington
#9 Kalorama
这是我期望发生的事情。但是,当我针对从craigslist抓取的数据运行它来扩大范围时,会遇到关键错误。
#this is from my main data source
neighborhood_results = post_results[['neighborhood']].copy()
neighborhood_results.to_csv('neighborhood_clean.csv',index=False)
for i in range(len(neighborhood_results['neighborhood'])):
for j in range(i + 1, len(neighborhood_results['neighborhood'])):
print(i)
print(j)
ratio = fw.partial_ratio(neighborhood_results['neighborhood'][i],neighborhood_results['neighborhood'][j])
if ratio>90:
neighborhood_results['neighborhood'][j] = neighborhood_results['neighborhood'][i]
当我运行这段代码时,print(I) print(j)
会按预期返回0和1,但随后出现键错误。
ratio = fw.partial_ratio(neighborhood_results['neighborhood'][i],neighborhood_results['neighborhood'][j])
第871行,在获取项
中result = self.index.get_value(self, key)
文件“ C:\ Users \ cards \ AppData \ Local \ Programs \ Python \ Python38-32 \ lib \ site-packages \ pandas \ core \ indexes \ base.py”, 第4405行,位于get_value中 返回self._engine.get_value(s,k,tz = getattr(series.dtype,“ tz”,无))文件“ pandas_libs \ index.pyx”,第80行,在 pandas._libs.index.IndexEngine.get_value文件 第90行中的“ pandas_libs \ index.pyx” pandas._libs.index.IndexEngine.get_value文件 第138行中的“ pandas_libs \ index.pyx” pandas._libs.index.IndexEngine.get_loc文件 “ pandas_libs \ hashtable_class_helper.pxi”,第998行,在 pandas._libs.hashtable.Int64HashTable.get_item文件 “ pandas_libs \ hashtable_class_helper.pxi”,第1005行,在 pandas._libs.hashtable.Int64HashTable.get_item
KeyError:0
我的理解是,这与列和键的查找有关。但是,为什么它对较小的数据集有效,但对较大的数据集无效?
完整的剪贴代码:
from bs4 import BeautifulSoup
import json
from requests import get
import numpy as np
import pandas as pd
import csv
from fuzzywuzzy import fuzz as fw
print('hello world')
#get the initial page for the listings, to get the total count
response = get('https://washingtondc.craigslist.org/search/hhh?query=rent&availabilityMode=0&sale_date=all+dates')
html_result = BeautifulSoup(response.text, 'html.parser')
results = html_result.find('div', class_='search-legend')
total = int(results.find('span',class_='totalcount').text)
pages = np.arange(0,total+1,120)
neighborhood = []
bedroom_count =[]
sqft = []
price = []
link = []
count = 0
for page in pages:
response = get('https://washingtondc.craigslist.org/search/hhh?s='+str(page)+'query=rent&availabilityMode=0&sale_date=all+dates')
html_result = BeautifulSoup(response.text, 'html.parser')
posts = html_result.find_all('li', class_='result-row')
for post in posts:
if post.find('span',class_='result-hood') is not None:
post_url = post.find('a',class_='result-title hdrlnk')
post_link = post_url['href']
link.append(post_link)
post_neighborhood = post.find('span',class_='result-hood').text
post_price = int(post.find('span',class_='result-price').text.strip().replace('$',''))
neighborhood.append(post_neighborhood)
price.append(post_price)
if post.find('span',class_='housing') is not None:
if 'ft2' in post.find('span',class_='housing').text.split()[0]:
post_bedroom = np.nan
post_footage = post.find('span',class_='housing').text.split()[0][:-3]
bedroom_count.append(post_bedroom)
sqft.append(post_footage)
elif len(post.find('span',class_='housing').text.split())>2:
post_bedroom = post.find('span',class_='housing').text.replace("br","").split()[0]
post_footage = post.find('span',class_='housing').text.split()[2][:-3]
bedroom_count.append(post_bedroom)
sqft.append(post_footage)
elif len(post.find('span',class_='housing').text.split())==2:
post_bedroom = post.find('span',class_='housing').text.replace("br","").split()[0]
post_footage = np.nan
bedroom_count.append(post_bedroom)
sqft.append(post_footage)
else:
post_bedroom = np.nan
post_footage = np.nan
bedroom_count.append(post_bedroom)
sqft.append(post_footage)
count+=1
print(count)
#create results data frame
post_results = pd.DataFrame({'neighborhood':neighborhood,'footage':sqft,'bedroom':bedroom_count,'price':price,'link':link})
#clean up results
post_results.drop_duplicates(subset='link')
post_results['footage'] = post_results['footage'].replace(0,np.nan)
post_results['bedroom'] = post_results['bedroom'].replace(0,np.nan)
post_results['neighborhood'] = post_results['neighborhood'].str.strip().str.strip('(|)')
post_results['neighborhood'] = post_results['neighborhood'].str.lower()
post_results = post_results.dropna(subset=['footage','bedroom'],how='all')
post_results.to_csv("rent_clean.csv",index=False)
neighborhood_results = post_results[['neighborhood']].copy()
neighborhood_results.to_csv('neighborhood_clean.csv',index=False)
for i in range(len(neighborhood_results['neighborhood'])):
for j in range(i + 1, len(neighborhood_results['neighborhood'])):
print(i)
print(j)
ratio = fw.partial_ratio(neighborhood_results['neighborhood'][i],neighborhood_results['neighborhood'][j])
if ratio>90:
neighborhood_results['neighborhood'][j] = neighborhood_results['neighborhood'][i]
neighborhood_results.to_csv('neighborhood_clean_a.csv',index=False)
答案 0 :(得分:0)
让mongodb
为您完成这项工作。它提供了非常简单的函数来遍历行和列:
很容易忘记索引在代码中的工作方式,并且通过使用迭代器,您知道自己正在访问所有可能的项。