Question

有一个名为df的数据框如下：

  name   id    age             text 
   a      1     1    very good, and I like him
   b      2     2    I play basketball with his brother
   c      3     3    I hope to get a offer
   d      4     4    everything goes well, I think
   a      1     1    I will visit china
   b      2     2    no one can understand me, I will solve it
   c      3     3    I like followers
   d      4     4    maybe I will be good
   a      1     1    I should work hard to finish my research
   b      2     2    water is the source of earth, I agree it
   c      3     3    I hope you can keep in touch with me
   d      4     4    My baby is very cute, I like him

你知道，有四个名字：a，b，c，d。每个名字都有id，年龄和文字。实际上每个名称组的id，age都相同，但每个名称组的文本都不同，每个名称有三行（这只是一个例子，真实数据是大数据）

我想获取每个名称组的id，年龄（例如）。另外，我想通过函数：extract_text（text）计算文本中每个组的所有文本中的字符索引。我的意思是我想得到以下数据：以名字'a'为例：age：1，id：1。'我'索引三行（我只举一个例子，而不是真实的）：20,0， 0

我试图做如下：

 import  pandas as pd

 def extract_text(text):
     index_n = None
     text_len = len(text)
     for i in range(0, text_len, 1):
         if text[i] == 'I':
            index_n = i
     return index_n



 df = pd.DataFrame({'name': ['a', 'b', 'c', 'd', 'a', 'b', 'c', 'd',     
                            'a', 'b', 'c', 'd'],
               'id': [1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4],
               'age':[1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4],
               'text':['very good, and I like him', 
                       'I play basketball with his brother',
                       'I hope to get a offer', 
                       'everything goes well, I think',
                       'I will visit china', 
                       'no one can understand me, I will solve it',
                       'I like followers', 'maybe I will be good',
                       'I should work hard to finish my research',                 
                       'water is the source of earth, I agree it',
                       'I hope you can keep in touch with me', 
                       'My baby is very cute, I like him']})


  id_num = df.groupby('name')['id'].value[0]
  id_num = df.groupby('age')['id'].value[0]
  index_num = df.groupby('age')['text'].apply(extract_text)

但是有错误：

追踪（最近一次调用最后一次）：文件
  bot_test_new.py“，第25行，中   id_num = df.groupby（'name'）['id']。value [0]
  AttributeError：'SeriesGroupBy'对象没有属性'value'

请先帮助我，谢谢！

Answer 1

我会在评论中详细说明。问题是extract_text只能处理单个字符串。但是，当您分组然后应用时，您将发送一个包含该组中所有字符串的列表。

有两种解决方案，第一种是我指示的解决方案（发送单个字符串）：

index_num = df.groupby('age')['text'].apply(lambda x: [extract_text(_) for _ in x])

另一个是更改extract_text，以便它可以处理字符串列表：

 def extract_text(list_texts):
    list_index = []
    for text in list_texts:
        index_n = None
        text_len = len(text)
        for i in range(0, text_len, 1):
            if text[i] == 'I':
                index_n = i
        list_index.append(index_n)
    return list_index

然后继续：

index_num = df.groupby('age')['text'].apply(extract_text)

此外，你可以在extract_text中使用text.find("I")而不是你的循环。像这样def extract_text(list_texts): return [text.find("I") for text in list_texts]。

Answer 2

我认为您可以使用str.find：

print (df.groupby('age')['text'].apply(lambda x: x.str.find('I').tolist()))
age
1     [15, 0, 0]
2    [0, 26, 30]
3      [0, 0, 0]
4    [22, 6, 22]
Name: text, dtype: object

如果需要id_num使用iloc：

id_num = df.groupby('name')['id'].apply(lambda x: x.iloc[0])
print (id_num)
name
a    1
b    2
c    3
d    4
Name: id, dtype: int64

但看起来你只能使用：

df['position'] = df['text'].str.find('I')

print (df)
    age  id name                                       text  position
0     1   1    a                  very good, and I like him        15
1     2   2    b         I play basketball with his brother         0
2     3   3    c                      I hope to get a offer         0
3     4   4    d              everything goes well, I think        22
4     1   1    a                         I will visit china         0
5     2   2    b  no one can understand me, I will solve it        26
6     3   3    c                           I like followers         0
7     4   4    d                       maybe I will be good         6
8     1   1    a   I should work hard to finish my research         0
9     2   2    b   water is the source of earth, I agree it        30
10    3   3    c       I hope you can keep in touch with me         0
11    4   4    d           My baby is very cute, I like him        22

Python pandas：按列（例如，名称）分组的数据框，并获取每个组

2 个答案: