我有一个数据框,就像我要保留的行数超过5个字符一样。这是我尝试过的方法,但是它删除了“ of”,“ U。”,“ and”,“ Arts”等。我只需要删除len小于5的行中的字符。
id schools
1 University of Hawaii
2 Dept in Colorado U.
3 Dept
4 College of Arts and Science
5 Dept
6 Bldg
我的代码输出错误:
0 University Hawaii
1 Colorado
2
3 College Science
4
5
寻找这样的输出:
id schools
1 University of Hawaii
2 Dept in Colorado U.
4 College of Arts and Science
代码:
l = [1,2,3,4,5,6]
s = ['University of Hawaii', 'Dept in Colorado U.','Dept','College of Arts and Science','Dept','Bldg']
df1 = pd.DataFrame({'id':l, 'schools':s})
df1 = df1['schools'].str.findall('\w{5,}').str.join(' ') # not working
df1
答案 0 :(得分:2)
对于此任务,使用正则表达式是一个巨大(且缓慢)的过大杀伤力。您可以使用简单的熊猫索引:
filtrered_df = df1[df1['schools'].str.len() > 5] # or >= depending on the required logic
答案 1 :(得分:0)
为您的数据提供一个更简单的过滤器。
mask = df1['schools'].str.len() > 5
然后从过滤器创建一个新的数据框
df2 = df1[mask].copy()
答案 2 :(得分:-1)
import pandas as pd
name = ['University of Hawaii','Dept in Colorado U.','Dept','College of Arts and Science','Dept','Bldg']
labels =['schools']
df =pd.DataFrame.from_records([[i] for i in name],columns=labels)
df[df['schools'].str.len() >5 ]