我有一个数据框,使得该列包含json对象和字符串。我想摆脱不包含json对象的行。
以下是我的数据框架的样子:
import pandas as pd
df = pd.DataFrame({'A': ["hello","world",{"a":5,"b":6,"c":8},"usa","india",{"a":9,"b":10,"c":11}]})
print(df)
如何删除仅包含字符串的行,以便在删除这些字符串行后,我可以在下面应用此列将json对象转换为数据帧的单独列:
from pandas.io.json import json_normalize
df = json_normalize(df['A'])
print(df)
答案 0 :(得分:3)
我想我更愿意使用isinstance
支票:
In [11]: df.loc[df.A.apply(lambda d: isinstance(d, dict))]
Out[11]:
A
2 {'a': 5, 'b': 6, 'c': 8}
5 {'d': 9, 'e': 10, 'f': 11}
如果你想包括数字,你可以这样做:
In [12]: df.loc[df.A.apply(lambda d: isinstance(d, (dict, np.number)))]
Out[12]:
A
2 {'a': 5, 'b': 6, 'c': 8}
5 {'d': 9, 'e': 10, 'f': 11}
根据您想要包含的类型进行调整...
最后一步,json_normalize获取一个json对象列表,无论出于什么原因,一个系列都不好(并给出了KeyError),你可以将它作为一个列表并且你的好处:
In [21]: df1 = df.loc[df.A.apply(lambda d: isinstance(d, (dict, np.number)))]
In [22]: json_normalize(list(df1["A"]))
Out[22]:
a b c d e f
0 5.0 6.0 8.0 NaN NaN NaN
1 NaN NaN NaN 9.0 10.0 11.0
答案 1 :(得分:1)
df[df.applymap(np.isreal).sum(1).gt(0)]
Out[794]:
A
2 {'a': 5, 'b': 6, 'c': 8}
5 {'d': 9, 'e': 10, 'f': 11}
答案 2 :(得分:0)
如果你想要一个也很有效的丑陋解决方案......这里是我创建的一个函数,它找到只包含字符串的列,并返回df减去那些行。 (因为你的df只有一列,你只需要包含1列所有dicts的数据帧)。然后,从那里,你想要使用
df = json_normalize(df['A'].values)
而非df = json_normalize(df['A'])
。
对于单列数据帧...
import pandas as pd
import numpy as np
from pandas.io.json import json_normalize
def delete_strings(df):
nrows = df.shape[0]
rows_to_keep = []
for row in np.arange(nrows):
if type(df.iloc[row,0]) == dict:
rows_to_keep.append(row) #add the row number to list of rows
#to keep if the row contains a dict
return df.iloc[rows_to_keep,0] #return only rows with dicts
df = pd.DataFrame({'A': ["hello","world",{"a":5,"b":6,"c":8},"usa","india",
{"a":9,"b":10,"c":11}]})
df = delete_strings(df)
df = json_normalize(df['A'].values)
print(df)
#0 {'a': 5, 'b': 6, 'c': 8}
#1 {'a': 9, 'b': 10, 'c': 11}
对于多列df(也适用于单列df):
def delete_rows_of_strings(df):
rows = df.shape[0] #of rows in df
cols = df.shape[1] #of coluns in df
rows_to_keep = [] #list to track rows to keep
for row in np.arange(rows): #for every row in the dataframe
#num_string will count the number of strings in the row
num_string = 0
for col in np.arange(cols): #for each column in the row...
#if the value is a string, add one to num_string
if type(df.iloc[row,col]) == str:
num_string += 1
#if num_string, the number of strings in the column,
#isn't equal to the number of columns in the row...
if num_string != cols: #...add that row number to the list of rows to keep
rows_to_keep.append(row)
#return the df with rows containing at least one non string
return(df.iloc[rows_to_keep,:])
df = pd.DataFrame({'A': ["hello","world",{"a":5,"b":6,"c":8},"usa","india"],
'B' : ['hi',{"a":5,"b":6,"c":8},'sup','america','china']})
# A B
#0 hello hi
#1 world {'a': 5, 'b': 6, 'c': 8}
#2 {'a': 5, 'b': 6, 'c': 8} sup
print(delete_rows_of_strings(df))
# A B
#1 world {'a': 5, 'b': 6, 'c': 8}
#2 {'a': 5, 'b': 6, 'c': 8} sup