pandas.DataFrame.drop_duplicates(inplace = True)抛出'TypeError:无法散列的类型:'dict'

时间:2018-08-01 00:49:17

标签: python pandas dataframe

这是我的代码:

第1块

import requests
import pandas as pd

url = ('http://www.omdbapi.com/' '?apikey=ff21610b&t=social+network')
r = requests.get(url)
json_data = r.json()
# from app
print(json_data['Awards'])
json_dict = dict(json_data)
tab=""
# printing all data as Dictionary
print("JSON as Dictionary (all):\n")
for k,v in json_dict.items():
  if len(k) > 6:
    tab = "\t"
  else:
    tab = "\t\t"
  print(str(k) + ":" + tab + str(v))
df = pd.DataFrame(json_dict)
df.drop_duplicates(inplace=True)
# printing Pandas DataFrame of all data
print("JSON as DataFrame (all):\n{}".format(df))

我刚刚在DataCamp上测试了一个示例问题。然后我开始探索不同的事物。问题在print(json_data['Awards'])处停止。我走得更远,正在测试将JSON文件转换为字典并为其创建pandas DataFrame。有趣的是,我的输出如下:

Won 3 Oscars. Another 165 wins & 168 nominations.
JSON as Dictionary (all):

Title:      The Social Network
Year:       2010
Rated:      PG-13
Released:   01 Oct 2010
Runtime:    120 min
Genre:      Biography, Drama
Director:   David Fincher
Writer:     Aaron Sorkin (screenplay), Ben Mezrich (book)
Actors:     Jesse Eisenberg, Rooney Mara, Bryan Barter, Dustin Fitzsimons
Plot:       Harvard student Mark Zuckerberg creates the social networking site that would become known as Facebook, but is later sued by two brothers who claimed he stole their idea, and the co-founder who was later squeezed out of the business.
Language:   English, French
Country:    USA
Awards:     Won 3 Oscars. Another 165 wins & 168 nominations.
Poster:     https://m.media-amazon.com/images/M/MV5BMTM2ODk0NDAwMF5BMl5BanBnXkFtZTcwNTM1MDc2Mw@@._V1_SX300.jpg
Ratings:    [{'Source': 'Internet Movie Database', 'Value': '7.7/10'}, {'Source': 'Rotten Tomatoes', 'Value': '96%'}, {'Source': 'Metacritic', 'Value': '95/100'}]
Metascore:  95
imdbRating: 7.7
imdbVotes:  542,658
imdbID:     tt1285016
Type:       movie
DVD:        11 Jan 2011
BoxOffice:  $96,400,000
Production: Columbia Pictures
Website:    http://www.thesocialnetwork-movie.com/
Response:   True
Traceback (most recent call last):
  File "C:\Users\rschosta\OneDrive - Incitec Pivot Limited\Documents\Data Science\omdb-api-test.py", line 20, in <module>
    df.drop_duplicates(inplace=True)
  File "C:\Users\rschosta\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\frame.py", line 3535, in drop_duplicates
    duplicated = self.duplicated(subset, keep=keep)
  File "C:\Users\rschosta\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\frame.py", line 3582, in duplicated
    labels, shape = map(list, zip(*map(f, vals)))
  File "C:\Users\rschosta\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\frame.py", line 3570, in f
    vals, size_hint=min(len(self), _SIZE_HINT_LIMIT))
  File "C:\Users\rschosta\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\algorithms.py", line 471, in factorize
    labels = table.get_labels(values, uniques, 0, na_sentinel, check_nulls)
  File "pandas/_libs/hashtable_class_helper.pxi", line 1367, in pandas._libs.hashtable.PyObjectHashTable.get_labels
TypeError: unhashable type: 'dict'

我正在对.drop_duplicates()进行一些研究,因为我以前曾经使用过它,但是效果很好。这是一个工作正常的示例代码:

第2块

import pandas as pd
import numpy as np

#Create a DataFrame
d = {
    'Name':['Alisa','Bobby','jodha','jack','raghu','Cathrine',
            'Alisa','Bobby','kumar','Alisa','Alex','Cathrine'],
    'Age':[26,24,23,22,23,24,26,24,22,23,24,24],

    'Score':[85,63,55,74,31,77,85,63,42,62,89,77]}

df = pd.DataFrame(d,columns=['Name','Age','Score'])
print(df)
df.drop_duplicates(keep=False, inplace=True)
print(df)

请注意,这两个代码块有所不同。我在第一个脚本中将numpy导入为np,它并没有改变结果。

关于如何使drop_duplicates()方法在第1块上起作用的任何想法?

输出块1-A

每个@Wen请求,以下是作为字典的数据:

{'Title': 'The Social Network', 'Year': '2010', 'Rated': 'PG-13', 'Released': '01 Oct 2010', 'Runtime': '120 min', 'Genre': 'Biography, Drama', 'Director': 'David Fincher', 'Writer': 'Aaron Sorkin (screenplay), Ben Mezrich (book)', 'Actors': 'Jesse Eisenberg, Rooney Mara, Bryan Barter, Dustin Fitzsimons', 'Plot': 'Harvard student Mark Zuckerberg creates the social networking site that would become known as Facebook, but is later sued by two brothers who claimed he stole their idea, and the co-founder who was later squeezed out of the business.', 'Language': 'English, French', 'Country': 'USA', 'Awards': 'Won 3 Oscars. Another 165 wins & 168 nominations.', 'Poster': 'https://m.media-amazon.com/images/M/MV5BMTM2ODk0NDAwMF5BMl5BanBnXkFtZTcwNTM1MDc2Mw@@._V1_SX300.jpg', 'Ratings': [{'Source': 'Internet Movie Database', 'Value': '7.7/10'}, {'Source': 'Rotten Tomatoes', 'Value': '96%'}, {'Source': 'Metacritic', 'Value': '95/100'}], 'Metascore': '95', 'imdbRating': '7.7', 'imdbVotes': '542,658', 'imdbID': 'tt1285016', 'Type': 'movie', 'DVD': '11 Jan 2011', 'BoxOffice': '$96,400,000', 'Production': 'Columbia Pictures', 'Website': 'http://www.thesocialnetwork-movie.com/', 'Response': 'True'}

现在,在删除重复项之前,我不打算调用.drop_duplicates()方法,而是将Ratings字典转换为列,我在打印的表格列表中也有更多输出,更容易阅读:

Title:      The Social Network
Year:       2010
Rated:      PG-13
Released:   01 Oct 2010
Runtime:    120 min
Genre:      Biography, Drama
Director:   David Fincher
Writer:     Aaron Sorkin (screenplay), Ben Mezrich (book)
Actors:     Jesse Eisenberg, Rooney Mara, Bryan Barter, Dustin Fitzsimons
Plot:       Harvard student Mark Zuckerberg creates the social networking site that would become known as Facebook, but is later sued by two brothers who claimed he stole their idea, and the co-founder who was later squeezed out of the business.
Language:   English, French
Country:    USA
Awards:     Won 3 Oscars. Another 165 wins & 168 nominations.
Poster:     https://m.media-amazon.com/images/M/MV5BMTM2ODk0NDAwMF5BMl5BanBnXkFtZTcwNTM1MDc2Mw@@._V1_SX300.jpg
Ratings:    [{'Source': 'Internet Movie Database', 'Value': '7.7/10'}, {'Source': 'Rotten Tomatoes', 'Value': '96%'}, {'Source': 'Metacritic', 'Value': '95/100'}]
Metascore:  95
imdbRating: 7.7
imdbVotes:  542,658
imdbID:     tt1285016
Type:       movie
DVD:        11 Jan 2011
BoxOffice:  $96,400,000
Production: Columbia Pictures
Website:    http://www.thesocialnetwork-movie.com/
Response:   True

2 个答案:

答案 0 :(得分:4)

您有一个Ratings列,其中填充了词典。因此,您不能使用drop_duplicates,因为dicts是可变的且不可散列。

作为解决方案,您可以transform将这些值作为元组的frozenset,然后使用drop_duplicates

df['Ratings'] = df.Ratings.transform(lambda k: frozenset(k.items()))
df.drop_duplicates()

或仅选择要用作参考的列。例如,如果您要删除仅基于yeartitle的重复项,则可以执行类似

的操作
ref_cols = ['Title', 'Year']
df.loc[~df[ref_cols].duplicated()]

答案 1 :(得分:1)

<body onload="myFunction()"> <p>result: <span id="result"></span></p>通常会产生这些问题,一种方法是将Objectdict转换为list

str