遍历pandas数据框列以比较结果

时间:2020-06-15 12:31:06

标签: python pandas numpy dataframe

我有一个带有一列本地社区的熊猫数据框。我想做的是浏览本专栏文章,并将每个邻域相互比较,以期对数据进行序列化。当我在python shell中使用一小部分数据时,它可以正常工作:

n = pd.DataFrame({'neighborhood':['Dupont Circle', 'Adams Morgan', 'alexandria', 'west end/dupont circle', 'logan circle', 'alexandria, va', 'washington', 'adam morgan/kalorama', 'Washington DC', 'Kalorama']})
print(n)
#results
#            neighborhood
#0           Dupont Circle
#1            Adams Morgan
#2              alexandria
#3  west end/dupont circle
#4            logan circle
#5          alexandria, va
#6              washington
#7    adam morgan/kalorama
#8           Washington DC
#9                Kalorama
for i in range(len(n['neighborhood'])):
    for j in range(i + 1, len(n['neighborhood'])):
        ratio = fw.partial_ratio(n['neighborhood'][i].lower(),n['neighborhood'][j].lower())
        print(n['neighborhood'][i]+' : '+n['neighborhood'][j]+' - '+str(ratio))
        if ratio>90:
            n['neighborhood'][j] = n['neighborhood'][i]
        print(n['neighborhood'][i]+' : '+n['neighborhood'][j])
print(n)
#results
#   neighborhood
#0  Dupont Circle
#1   Adams Morgan
#2     alexandria
#3  Dupont Circle
#4   logan circle
#5     alexandria
#6     washington
#7   Adams Morgan
#8     washington
#9       Kalorama

这是我期望发生的事情。但是,当我针对从craigslist抓取的数据运行它来扩大范围时,会遇到关键错误。

#this is from my main data source
neighborhood_results = post_results[['neighborhood']].copy()
neighborhood_results.to_csv('neighborhood_clean.csv',index=False)

for i in range(len(neighborhood_results['neighborhood'])):
    for j in range(i + 1, len(neighborhood_results['neighborhood'])):
            print(i)
            print(j)
            ratio = fw.partial_ratio(neighborhood_results['neighborhood'][i],neighborhood_results['neighborhood'][j])
            if ratio>90:
                neighborhood_results['neighborhood'][j] = neighborhood_results['neighborhood'][i]

当我运行这段代码时,print(I) print(j)会按预期返回0和1,但随后出现键错误。

ratio = fw.partial_ratio(neighborhood_results['neighborhood'][i],neighborhood_results['neighborhood'][j])

第871行,在获取项

result = self.index.get_value(self, key)

文件“ C:\ Users \ cards \ AppData \ Local \ Programs \ Python \ Python38-32 \ lib \ site-packages \ pandas \ core \ indexes \ base.py”, 第4405行,位于get_value中 返回self._engine.get_value(s,k,tz = getattr(series.dtype,“ tz”,无))文件“ pandas_libs \ index.pyx”,第80行,在 pandas._libs.index.IndexEngine.get_value文件 第90行中的“ pandas_libs \ index.pyx” pandas._libs.index.IndexEngine.get_value文件 第138行中的“ pandas_libs \ index.pyx” pandas._libs.index.IndexEngine.get_loc文件 “ pandas_libs \ hashtable_class_helper.pxi”,第998行,在 pandas._libs.hashtable.Int64HashTable.get_item文件 “ pandas_libs \ hashtable_class_helper.pxi”,第1005行,在 pandas._libs.hashtable.Int64HashTable.get_item

KeyError:0

我的理解是,这与列和键的查找有关。但是,为什么它对较小的数据集有效,但对较大的数据集无效?

完整的剪贴代码:

from bs4 import BeautifulSoup
import json
from requests import get
import numpy as np
import pandas as pd
import csv
from fuzzywuzzy import fuzz as fw

print('hello world')
#get the initial page for the listings, to get the total count
response = get('https://washingtondc.craigslist.org/search/hhh?query=rent&availabilityMode=0&sale_date=all+dates')
html_result = BeautifulSoup(response.text, 'html.parser')
results = html_result.find('div', class_='search-legend')
total = int(results.find('span',class_='totalcount').text)
pages = np.arange(0,total+1,120)

neighborhood = []
bedroom_count =[]
sqft = []
price = []
link = []
count = 0
for page in pages:

    response = get('https://washingtondc.craigslist.org/search/hhh?s='+str(page)+'query=rent&availabilityMode=0&sale_date=all+dates')
    html_result = BeautifulSoup(response.text, 'html.parser')

    posts = html_result.find_all('li', class_='result-row')
    for post in posts:
        if post.find('span',class_='result-hood') is not None:
            post_url = post.find('a',class_='result-title hdrlnk')
            post_link = post_url['href']
            link.append(post_link)
            post_neighborhood = post.find('span',class_='result-hood').text
            post_price = int(post.find('span',class_='result-price').text.strip().replace('$',''))
            neighborhood.append(post_neighborhood)
            price.append(post_price)
            if post.find('span',class_='housing') is not None:
                if 'ft2' in post.find('span',class_='housing').text.split()[0]:
                    post_bedroom = np.nan
                    post_footage = post.find('span',class_='housing').text.split()[0][:-3]
                    bedroom_count.append(post_bedroom)
                    sqft.append(post_footage)
                elif len(post.find('span',class_='housing').text.split())>2:
                    post_bedroom = post.find('span',class_='housing').text.replace("br","").split()[0]
                    post_footage = post.find('span',class_='housing').text.split()[2][:-3]
                    bedroom_count.append(post_bedroom)
                    sqft.append(post_footage)
                elif len(post.find('span',class_='housing').text.split())==2:
                    post_bedroom = post.find('span',class_='housing').text.replace("br","").split()[0]
                    post_footage = np.nan
                    bedroom_count.append(post_bedroom)
                    sqft.append(post_footage)
            else:
                post_bedroom = np.nan
                post_footage = np.nan
                bedroom_count.append(post_bedroom)
                sqft.append(post_footage)
        count+=1

print(count)
#create results data frame
post_results = pd.DataFrame({'neighborhood':neighborhood,'footage':sqft,'bedroom':bedroom_count,'price':price,'link':link})
#clean up results
post_results.drop_duplicates(subset='link')
post_results['footage'] = post_results['footage'].replace(0,np.nan)
post_results['bedroom'] = post_results['bedroom'].replace(0,np.nan)
post_results['neighborhood'] = post_results['neighborhood'].str.strip().str.strip('(|)')
post_results['neighborhood'] = post_results['neighborhood'].str.lower()
post_results = post_results.dropna(subset=['footage','bedroom'],how='all')
post_results.to_csv("rent_clean.csv",index=False)

neighborhood_results = post_results[['neighborhood']].copy()
neighborhood_results.to_csv('neighborhood_clean.csv',index=False)

for i in range(len(neighborhood_results['neighborhood'])):
    for j in range(i + 1, len(neighborhood_results['neighborhood'])):
            print(i)
            print(j)
            ratio = fw.partial_ratio(neighborhood_results['neighborhood'][i],neighborhood_results['neighborhood'][j])
            if ratio>90:
                neighborhood_results['neighborhood'][j] = neighborhood_results['neighborhood'][i]

neighborhood_results.to_csv('neighborhood_clean_a.csv',index=False)

1 个答案:

答案 0 :(得分:0)

mongodb为您完成这项工作。它提供了非常简单的函数来遍历行和列:

很容易忘记索引在代码中的工作方式,并且通过使用迭代器,您知道自己正在访问所有可能的项。