Question

我有一个带有一列本地社区的熊猫数据框。我想做的是浏览本专栏文章，并将每个邻域相互比较，以期对数据进行序列化。当我在python shell中使用一小部分数据时，它可以正常工作：

n = pd.DataFrame({'neighborhood':['Dupont Circle', 'Adams Morgan', 'alexandria', 'west end/dupont circle', 'logan circle', 'alexandria, va', 'washington', 'adam morgan/kalorama', 'Washington DC', 'Kalorama']})
print(n)
#results
#            neighborhood
#0           Dupont Circle
#1            Adams Morgan
#2              alexandria
#3  west end/dupont circle
#4            logan circle
#5          alexandria, va
#6              washington
#7    adam morgan/kalorama
#8           Washington DC
#9                Kalorama
for i in range(len(n['neighborhood'])):
    for j in range(i + 1, len(n['neighborhood'])):
        ratio = fw.partial_ratio(n['neighborhood'][i].lower(),n['neighborhood'][j].lower())
        print(n['neighborhood'][i]+' : '+n['neighborhood'][j]+' - '+str(ratio))
        if ratio>90:
            n['neighborhood'][j] = n['neighborhood'][i]
        print(n['neighborhood'][i]+' : '+n['neighborhood'][j])
print(n)
#results
#   neighborhood
#0  Dupont Circle
#1   Adams Morgan
#2     alexandria
#3  Dupont Circle
#4   logan circle
#5     alexandria
#6     washington
#7   Adams Morgan
#8     washington
#9       Kalorama

这是我期望发生的事情。但是，当我针对从craigslist抓取的数据运行它来扩大范围时，会遇到关键错误。

#this is from my main data source
neighborhood_results = post_results[['neighborhood']].copy()
neighborhood_results.to_csv('neighborhood_clean.csv',index=False)

for i in range(len(neighborhood_results['neighborhood'])):
    for j in range(i + 1, len(neighborhood_results['neighborhood'])):
            print(i)
            print(j)
            ratio = fw.partial_ratio(neighborhood_results['neighborhood'][i],neighborhood_results['neighborhood'][j])
            if ratio>90:
                neighborhood_results['neighborhood'][j] = neighborhood_results['neighborhood'][i]

当我运行这段代码时，print(I) print(j)会按预期返回0和1，但随后出现键错误。

ratio = fw.partial_ratio(neighborhood_results['neighborhood'][i],neighborhood_results['neighborhood'][j])
第871行，在获取项
中
result = self.index.get_value(self, key)
文件“ C：\ Users \ cards \ AppData \ Local \ Programs \ Python \ Python38-32 \ lib \ site-packages \ pandas \ core \ indexes \ base.py”，第4405行，位于get_value中返回self._engine.get_value（s，k，tz = getattr（series.dtype，“ tz”，无））文件“ pandas_libs \ index.pyx”，第80行，在 pandas._libs.index.IndexEngine.get_value文件第90行中的“ pandas_libs \ index.pyx” pandas._libs.index.IndexEngine.get_value文件第138行中的“ pandas_libs \ index.pyx” pandas._libs.index.IndexEngine.get_loc文件 “ pandas_libs \ hashtable_class_helper.pxi”，第998行，在 pandas._libs.hashtable.Int64HashTable.get_item文件 “ pandas_libs \ hashtable_class_helper.pxi”，第1005行，在 pandas._libs.hashtable.Int64HashTable.get_item

KeyError：0

我的理解是，这与列和键的查找有关。但是，为什么它对较小的数据集有效，但对较大的数据集无效？

完整的剪贴代码：

from bs4 import BeautifulSoup
import json
from requests import get
import numpy as np
import pandas as pd
import csv
from fuzzywuzzy import fuzz as fw

print('hello world')
#get the initial page for the listings, to get the total count
response = get('https://washingtondc.craigslist.org/search/hhh?query=rent&availabilityMode=0&sale_date=all+dates')
html_result = BeautifulSoup(response.text, 'html.parser')
results = html_result.find('div', class_='search-legend')
total = int(results.find('span',class_='totalcount').text)
pages = np.arange(0,total+1,120)

neighborhood = []
bedroom_count =[]
sqft = []
price = []
link = []
count = 0
for page in pages:

    response = get('https://washingtondc.craigslist.org/search/hhh?s='+str(page)+'query=rent&availabilityMode=0&sale_date=all+dates')
    html_result = BeautifulSoup(response.text, 'html.parser')

    posts = html_result.find_all('li', class_='result-row')
    for post in posts:
        if post.find('span',class_='result-hood') is not None:
            post_url = post.find('a',class_='result-title hdrlnk')
            post_link = post_url['href']
            link.append(post_link)
            post_neighborhood = post.find('span',class_='result-hood').text
            post_price = int(post.find('span',class_='result-price').text.strip().replace('$',''))
            neighborhood.append(post_neighborhood)
            price.append(post_price)
            if post.find('span',class_='housing') is not None:
                if 'ft2' in post.find('span',class_='housing').text.split()[0]:
                    post_bedroom = np.nan
                    post_footage = post.find('span',class_='housing').text.split()[0][:-3]
                    bedroom_count.append(post_bedroom)
                    sqft.append(post_footage)
                elif len(post.find('span',class_='housing').text.split())>2:
                    post_bedroom = post.find('span',class_='housing').text.replace("br","").split()[0]
                    post_footage = post.find('span',class_='housing').text.split()[2][:-3]
                    bedroom_count.append(post_bedroom)
                    sqft.append(post_footage)
                elif len(post.find('span',class_='housing').text.split())==2:
                    post_bedroom = post.find('span',class_='housing').text.replace("br","").split()[0]
                    post_footage = np.nan
                    bedroom_count.append(post_bedroom)
                    sqft.append(post_footage)
            else:
                post_bedroom = np.nan
                post_footage = np.nan
                bedroom_count.append(post_bedroom)
                sqft.append(post_footage)
        count+=1

print(count)
#create results data frame
post_results = pd.DataFrame({'neighborhood':neighborhood,'footage':sqft,'bedroom':bedroom_count,'price':price,'link':link})
#clean up results
post_results.drop_duplicates(subset='link')
post_results['footage'] = post_results['footage'].replace(0,np.nan)
post_results['bedroom'] = post_results['bedroom'].replace(0,np.nan)
post_results['neighborhood'] = post_results['neighborhood'].str.strip().str.strip('(|)')
post_results['neighborhood'] = post_results['neighborhood'].str.lower()
post_results = post_results.dropna(subset=['footage','bedroom'],how='all')
post_results.to_csv("rent_clean.csv",index=False)

neighborhood_results = post_results[['neighborhood']].copy()
neighborhood_results.to_csv('neighborhood_clean.csv',index=False)

for i in range(len(neighborhood_results['neighborhood'])):
    for j in range(i + 1, len(neighborhood_results['neighborhood'])):
            print(i)
            print(j)
            ratio = fw.partial_ratio(neighborhood_results['neighborhood'][i],neighborhood_results['neighborhood'][j])
            if ratio>90:
                neighborhood_results['neighborhood'][j] = neighborhood_results['neighborhood'][i]

neighborhood_results.to_csv('neighborhood_clean_a.csv',index=False)

Answer 1

让mongodb为您完成这项工作。它提供了非常简单的函数来遍历行和列：

pandas.DataFrame.iterrows（将DataFrame行作为（索引，系列）对进行迭代）
pandas.DataFrame.items（作为（列名，系列）对在DataFrame上迭代）
pandas.DataFrame.itertuples（将DataFrame行迭代为namedtuple）

很容易忘记索引在代码中的工作方式，并且通过使用迭代器，您知道自己正在访问所有可能的项。

遍历pandas数据框列以比较结果

1 个答案: