Python pandas:按列(例如,名称)分组的数据框,并获取每个组

时间:2016-08-31 10:38:49

标签: python string pandas dataframe group-by

有一个名为df的数据框如下:

  name   id    age             text 
   a      1     1    very good, and I like him
   b      2     2    I play basketball with his brother
   c      3     3    I hope to get a offer
   d      4     4    everything goes well, I think
   a      1     1    I will visit china
   b      2     2    no one can understand me, I will solve it
   c      3     3    I like followers
   d      4     4    maybe I will be good
   a      1     1    I should work hard to finish my research
   b      2     2    water is the source of earth, I agree it
   c      3     3    I hope you can keep in touch with me
   d      4     4    My baby is very cute, I like him

你知道,有四个名字:a,b,c,d。每个名字都有id,年龄和文字。实际上每个名称组的id,age都相同,但每个名称组的文本都不同,每个名称有三行(这只是一个例子,真实数据是大数据)

我想获取每个名称组的id,年龄(例如)。另外,我想通过函数:extract_text(text)计算文本中每个组的所有文本中的字符索引。我的意思是我想得到以下数据:以名字'a'为例:age:1,id:1。'我'索引三行(我只举一个例子,而不是真实的):20,0, 0

我试图做如下:

 import  pandas as pd

 def extract_text(text):
     index_n = None
     text_len = len(text)
     for i in range(0, text_len, 1):
         if text[i] == 'I':
            index_n = i
     return index_n



 df = pd.DataFrame({'name': ['a', 'b', 'c', 'd', 'a', 'b', 'c', 'd',     
                            'a', 'b', 'c', 'd'],
               'id': [1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4],
               'age':[1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4],
               'text':['very good, and I like him', 
                       'I play basketball with his brother',
                       'I hope to get a offer', 
                       'everything goes well, I think',
                       'I will visit china', 
                       'no one can understand me, I will solve it',
                       'I like followers', 'maybe I will be good',
                       'I should work hard to finish my research',                 
                       'water is the source of earth, I agree it',
                       'I hope you can keep in touch with me', 
                       'My baby is very cute, I like him']})


  id_num = df.groupby('name')['id'].value[0]
  id_num = df.groupby('age')['id'].value[0]
  index_num = df.groupby('age')['text'].apply(extract_text)

但是有错误:

  

追踪(最近一次调用最后一次):文件
  bot_test_new.py“,第25行,中   id_num = df.groupby('name')['id']。value [0]
  AttributeError:'SeriesGroupBy'对象没有属性'value'

请先帮助我,谢谢!

2 个答案:

答案 0 :(得分:1)

我会在评论中详细说明。问题是extract_text只能处理单个字符串。但是,当您分组然后应用时,您将发送一个包含该组中所有字符串的列表。

有两种解决方案,第一种是我指示的解决方案(发送单个字符串):

index_num = df.groupby('age')['text'].apply(lambda x: [extract_text(_) for _ in x]) 

另一个是更改extract_text,以便它可以处理字符串列表:

 def extract_text(list_texts):
    list_index = []
    for text in list_texts:
        index_n = None
        text_len = len(text)
        for i in range(0, text_len, 1):
            if text[i] == 'I':
                index_n = i
        list_index.append(index_n)
    return list_index

然后继续:

index_num = df.groupby('age')['text'].apply(extract_text)

此外,你可以在extract_text中使用text.find("I")而不是你的循环。像这样def extract_text(list_texts): return [text.find("I") for text in list_texts]

答案 1 :(得分:1)

我认为您可以使用str.find

print (df.groupby('age')['text'].apply(lambda x: x.str.find('I').tolist()))
age
1     [15, 0, 0]
2    [0, 26, 30]
3      [0, 0, 0]
4    [22, 6, 22]
Name: text, dtype: object

如果需要id_num使用iloc

id_num = df.groupby('name')['id'].apply(lambda x: x.iloc[0])
print (id_num)
name
a    1
b    2
c    3
d    4
Name: id, dtype: int64

但看起来你只能使用:

df['position'] = df['text'].str.find('I')

print (df)
    age  id name                                       text  position
0     1   1    a                  very good, and I like him        15
1     2   2    b         I play basketball with his brother         0
2     3   3    c                      I hope to get a offer         0
3     4   4    d              everything goes well, I think        22
4     1   1    a                         I will visit china         0
5     2   2    b  no one can understand me, I will solve it        26
6     3   3    c                           I like followers         0
7     4   4    d                       maybe I will be good         6
8     1   1    a   I should work hard to finish my research         0
9     2   2    b   water is the source of earth, I agree it        30
10    3   3    c       I hope you can keep in touch with me         0
11    4   4    d           My baby is very cute, I like him        22