通过另一个DataFrame中的唯一值过滤一个DataFrame

时间:2019-04-26 19:02:36

标签: python python-3.x dataframe filtering grouping

我有2个Python数据框:

第一个数据框包含导入到该数据框的所有数据,其中包括“产品代码”,“情感”,“ summaryText”,“ reviewText”等。所有初始审核数据。

DFF = DFF[['prodcode', 'summaryText', 'reviewText', 'overall', 'reviewerID', 'reviewerName', 'helpful','reviewTime', 'unixReviewTime', 'sentiment','textLength']]

产生:


     prodcode                                 summaryText                                         reviewText  overall      reviewerID    ...       helpful   reviewTime unixReviewTime  sentiment textLength
0  B00002243X  Work Well - Should Have Bought Longer Ones  I needed a set of jumper cables for my new car...      5.0  A3F73SC1LY51OO    ...        [4, 4]  08 17, 2011     1313539200          2        516
1  B00002243X                            Okay long cables  These long cables work fine for my truck, but ...      4.0  A20S66SKYXULG2    ...        [1, 1]   09 4, 2011     1315094400          2        265
2  B00002243X                  Looks and feels heavy Duty  Can't comment much on these since they have no...      5.0  A2I8LFSN2IS5EO    ...        [0, 0]  07 25, 2013     1374710400          2       1142
3  B00002243X       Excellent choice for Jumper Cables!!!  I absolutley love Amazon!!!  For the price of ...      5.0  A3GT2EWQSO45ZG    ...      [19, 19]  12 21, 2010     1292889600          2       4739
4  B00002243X      Excellent, High Quality Starter Cables  I purchased the 12' feet long cable set and th...      5.0  A3ESWJPAVRPWB4    ...        [0, 0]   07 4, 2012     1341360000          2        415

第二个数据框是所有产品代码以及对该产品进行的所有评论/所有评论的比率的分组。它是该评论分数与该特定产品做出的所有评论分数之比。

df1 = (
    DFF.groupby(["prodcode", "sentiment"]).count()
    .join(DFF.groupby("prodcode").count(), "prodcode", rsuffix="_r"))[['reviewText', 'reviewText_r']]

df1['result'] = df1['reviewText']/df1['reviewText_r']
df1 = df1.reset_index()
df1 = df1.pivot("prodcode", 'sentiment', 'result').fillna(0)
df1 = round(df1 * 100)
df1.astype('int')

sorted_df2 = df1.sort_values(['0', '1', '2'], ascending=False)

产生以下DF:

sentiment      0     1     2
prodcode                        
B0024E6QOO  80.0   0.0  20.0
B000GPV2QA  67.0  17.0  17.0
B0067DNSUI  67.0   0.0  33.0
B00192JH4S  62.0  12.0  25.0
B0087FSA0C  60.0  20.0  20.0
B0002KM5L0  60.0   0.0  40.0
B000DZBP60  60.0   0.0  40.0
B000PJCBOE  60.0   0.0  40.0
B0033A5PPO  57.0  29.0  14.0
B003POL69C  57.0  14.0  29.0
B0002Z9L8K  56.0  31.0  12.0

我现在尝试通过两种方式过滤我的第一个数据帧。第一个,由第二个数据帧的结果。那样的话,我的意思是我希望第一个数据帧由df1.sentiment ['0']> 40的第二个数据帧通过prodcode进行过滤。从该列表中,我想按那些'sentiment'的行过滤第一个数据帧。从第一个数据帧= 0开始。

从较高的角度来看,我正在尝试在第一个数据框中获取产品的prodcode,summaryText和reviewText,这些产品在较低的情感分数中具有较高的比率,并且其情感为0。

2 个答案:

答案 0 :(得分:0)

类似的东西:

假设所需的所有数据都在df1中,并且不需要合并。

m = list(DFF['prodcode'].loc[DFF['sentiment'] == 0] # create a list matching your criteria
df.loc[(df['0'] > 40) & (df['sentiment'].isin(m)] # filter according to your conditions 

答案 1 :(得分:0)

我知道了:

DF3 = pd.merge(DFF, df1, left_on='prodcode', right_on='prodcode')
print(DF3.loc[(DF3['0'] > 50.0) & (DF3['2'] < 50.0) & (DF3['sentiment'].isin(['0']))].sort_values('0', ascending=False))