我有一个pandas数据框,有99列dx1-dx99& 99列px1-px99。这些列的内容是长度为4到8个字符的代码。数字。
我想从这些列中仅过滤那些内容,其中这些内容的前三个字符与提供的列表中的三个字符匹配。提供的列表包含只有三个字符的字符串。
我动态生成并且非常长的提供列表的长度。因此,我必须将整个列表作为单独的字符串传递。
例如,我有这个数据框:
df = pd.DataFrame({'A': 'foo bar one123 bar foo one324 foo 0'.split(),
'B': 'one546 one765 twosde three twowef two234 onedfr three'.split(),
'C': np.arange(8), 'D': np.arange(8) * 2})
print(df)
A B C D
0 foo one546 0 0
1 0 one765 1 2
2 one123 twosde 2 4
3 bar three 3 6
4 foo twowef 4 8
5 one324 two234 5 10
6 foo onedfr 6 12
7 0 three 7 14
填充的单元格是对象类型,所有零都是NULL,我用pd.fillna(0)填充零。
当我这样做时:
keep = df.iloc[:,:].isin(['one123','one324','twosde','two234']).values
df.iloc[:,:] = df.iloc[:,:].where(keep, 0)
print(df)
我明白了:
A B C D
0 0 0 0 0
1 0 0 0 0
2 one123 twosde 0 0
3 0 0 0 0
4 0 0 0 0
5 one324 two234 0 0
6 0 0 0 0
7 0 0 0 0
但是我没有传递单个字符串'one123','one324','twosde','two234',而是传递一个包含像这样的部分字符串的列表:
startstrings = ['one', 'two']
keep = df.iloc[:,:].contains(startstrings)
df.iloc[:,:] = df.iloc[:,:].where(keep, 0)
print(df)
但上面的说法不行。我想保留所有以“一”或“两个”开头的内容。
知道如何实施?我的数据集很大,因此效率很重要。
答案 0 :(得分:3)
pandas str.contains
接受正则表达式,让您测试列表中的任何项目。遍历每一列并使用str.contains:
startstrings = ['one', 'two']
pattern = '|'.join(startstrings)
for col in df:
if all(df[col].apply(type) == str):
#Set any values to 0 if they don't contain value
df.ix[~df[col].str.contains(pattern), col] = 0
else:
#Column is not all strings
df[col] = 0
产地:
A B C D
0 0 one1 0 0
1 0 one1 0 0
2 one1 two1 0 0
3 0 0 0 0
4 0 two1 0 0
5 one1 two1 0 0
6 0 one1 0 0
7 0 0 0 0
答案 1 :(得分:0)
这是一个NumPy矢量化方法 -
# From http://stackoverflow.com/a/39045337/3293881
def slicer_vectorized(a,start,end):
b = a.view('S1').reshape(len(a),-1)[:,start:end]
return np.fromstring(b.tostring(),dtype='S'+str(end-start))
def isin_chars(df, startstrings, start=0, stop = 3):
a = df.values.astype(str)
ss_arr = np.sort(startstrings)
a_S3 = slicer_vectorized(a.ravel(), start, stop)
idx = np.searchsorted(ss_arr, a_S3)
mask = (a_S3 == ss_arr[idx]).reshape(a.shape)
return df.mask(~mask,0)
def process(df, startstrings, n = 100):
dx_names = ['dx'+str(i) for i in range(1,n)]
px_names = ['px'+str(i) for i in range(1,n)]
all_names = np.hstack((dx_names, px_names))
df0 = df[all_names]
df_out = isin_chars(df0, startstrings, start=0, stop = 3)
return df_out
示例运行 -
In [245]: df
Out[245]:
dx1 dx2 px1 px2 0
0 foo one1 0 0 0
1 bar one1 1 2 7
2 one1 two1 2 4 3
3 bar three 3 6 8
4 foo two1 4 8 1
5 one1 two1 5 10 8
6 foo one1 6 12 6
7 foo three 7 14 6
In [246]: startstrings = ['two', 'one']
In [247]: process(df, startstrings, n = 3) # change n = 100 for actual case
Out[247]:
dx1 dx2 px1 px2
0 0 one1 0 0
1 0 one1 0 0
2 one1 two1 0 0
3 0 0 0 0
4 0 two1 0 0
5 one1 two1 0 0
6 0 one1 0 0
7 0 0 0 0
答案 2 :(得分:0)
这是一种暴力攻击,但它允许使用不同长度的前缀字符串,如图所示。我修改了你的例子来寻找['one1','th']以显示不同的长度。不确定这是否是你需要的东西。
function formatDate(date) {
var day;
var month;
switch (date.getDay()) {
case 1: day = "Monday"; break;
case 2: day = "Tuesday"; break;
case 3: day = "Wednesday"; break;
case 4: day = "Thursday"; break;
case 5: day = "Friday"; break;
case 6: day = "Saturday"; break;
default: day = "Sunday";
}
switch (date.getMonth()) {
case 0: month = "January"; break;
case 1: month = "Febuary"; break;
case 2: month = "March"; break;
case 3: month = "April"; break;
case 4: month = "May"; break;
case 5: month = "June"; break;
case 6: month = "July"; break;
case 7: month = "August"; break;
case 8: month = "September"; break;
case 9: month = "October"; break;
case 10: month = "November"; break;
default: month = "December";
}
return day + ", " + month + " " + ("0" + date.getDate()).slice(-2) + " " + (1900 + date.getYear());
}
运行这个,我得到:
import numpy as np
import pandas as pd
df = pd.DataFrame({'A': 'foo bar one1 bar foo one1 foo foo'.split(),
'B': 'one1 one1 two1 three two1 two1 one1 three'.split(),
'C': np.arange(8), 'D': np.arange(8) * 2})
prefixes = "one1 th".split()
matches = np.full(df.shape, False, dtype=bool)
for pfx in prefixes:
for i,col in enumerate(df.columns):
try:
matches[:,i] |= df[col].str.startswith(pfx)
except AttributeError as e:
# Some columns have no strings
pass
keep = df.where(matches, 0)
print(keep)