有一个名为df的数据框如下:
name id age text
a 1 1 very good, and I like him
b 2 2 I play basketball with his brother
c 3 3 I hope to get a offer
d 4 4 everything goes well, I think
a 1 1 I will visit china
b 2 2 no one can understand me, I will solve it
c 3 3 I like followers
d 4 4 maybe I will be good
a 1 1 I should work hard to finish my research
b 2 2 water is the source of earth, I agree it
c 3 3 I hope you can keep in touch with me
d 4 4 My baby is very cute, I like him
你知道,有四个名字:a,b,c,d。每个名字都有id,年龄和文字。实际上每个名称组的id,age都相同,但每个名称组的文本都不同,每个名称有三行(这只是一个例子,真实数据是大数据)
我想获取每个名称组的id,年龄(例如)。另外,我想通过函数:extract_text(text)计算文本中每个组的所有文本中的字符索引。我的意思是我想得到以下数据:以名字'a'为例:age:1,id:1。'我'索引三行(我只举一个例子,而不是真实的):20,0, 0
我试图做如下:
import pandas as pd
def extract_text(text):
index_n = None
text_len = len(text)
for i in range(0, text_len, 1):
if text[i] == 'I':
index_n = i
return index_n
df = pd.DataFrame({'name': ['a', 'b', 'c', 'd', 'a', 'b', 'c', 'd',
'a', 'b', 'c', 'd'],
'id': [1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4],
'age':[1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4],
'text':['very good, and I like him',
'I play basketball with his brother',
'I hope to get a offer',
'everything goes well, I think',
'I will visit china',
'no one can understand me, I will solve it',
'I like followers', 'maybe I will be good',
'I should work hard to finish my research',
'water is the source of earth, I agree it',
'I hope you can keep in touch with me',
'My baby is very cute, I like him']})
id_num = df.groupby('name')['id'].value[0]
id_num = df.groupby('age')['id'].value[0]
index_num = df.groupby('age')['text'].apply(extract_text)
但是有错误:
追踪(最近一次调用最后一次):文件
bot_test_new.py“,第25行,中 id_num = df.groupby('name')['id']。value [0]
AttributeError:'SeriesGroupBy'对象没有属性'value'
请先帮助我,谢谢!
答案 0 :(得分:1)
我会在评论中详细说明。问题是extract_text只能处理单个字符串。但是,当您分组然后应用时,您将发送一个包含该组中所有字符串的列表。
有两种解决方案,第一种是我指示的解决方案(发送单个字符串):
index_num = df.groupby('age')['text'].apply(lambda x: [extract_text(_) for _ in x])
另一个是更改extract_text,以便它可以处理字符串列表:
def extract_text(list_texts):
list_index = []
for text in list_texts:
index_n = None
text_len = len(text)
for i in range(0, text_len, 1):
if text[i] == 'I':
index_n = i
list_index.append(index_n)
return list_index
然后继续:
index_num = df.groupby('age')['text'].apply(extract_text)
此外,你可以在extract_text中使用text.find("I")
而不是你的循环。像这样def extract_text(list_texts): return [text.find("I") for text in list_texts]
。
答案 1 :(得分:1)
我认为您可以使用str.find
:
print (df.groupby('age')['text'].apply(lambda x: x.str.find('I').tolist()))
age
1 [15, 0, 0]
2 [0, 26, 30]
3 [0, 0, 0]
4 [22, 6, 22]
Name: text, dtype: object
如果需要id_num
使用iloc
:
id_num = df.groupby('name')['id'].apply(lambda x: x.iloc[0])
print (id_num)
name
a 1
b 2
c 3
d 4
Name: id, dtype: int64
但看起来你只能使用:
df['position'] = df['text'].str.find('I')
print (df)
age id name text position
0 1 1 a very good, and I like him 15
1 2 2 b I play basketball with his brother 0
2 3 3 c I hope to get a offer 0
3 4 4 d everything goes well, I think 22
4 1 1 a I will visit china 0
5 2 2 b no one can understand me, I will solve it 26
6 3 3 c I like followers 0
7 4 4 d maybe I will be good 6
8 1 1 a I should work hard to finish my research 0
9 2 2 b water is the source of earth, I agree it 30
10 3 3 c I hope you can keep in touch with me 0
11 4 4 d My baby is very cute, I like him 22