比较新,并试图用CSV文件从python中分割一些数据。我试图解析这些数据,并在出现特定分隔符时将其拆分为新行。那些分隔符是'。' ';'和'#'。 COL_C也没有空格。另外,分隔符的顺序无关紧要,如果我们找到其中一个分隔符,则自动创建新行。
以下是示例数据
$content = '<div class="row">';
$content .= '<div class="col-md-9">';
$content .= '<div class="form-group row">';
$content .= '<label for="' . $this->app->getDef('text_products_favorites') . '" class="col-5 col-form-label">' . $this->app->getDef('text_products_favorites') . '</label>';
$content .= '<div class="col-md-5">';
$content .= HTML::checkboxField('products_favorites', 'yes', false);
$content .= '</div>';
$content .= '</div>';
$content .= '</div>';
$content .= '</div>';
COL_A | COL_B |COL_C
--------------------
我想要获得的输出将是:
Hello | World | Hi.Can;You#Help
COL_A | COL_B | COL_C
----------------------
Hello | World | Hi
Hello | World | Can
Hello | World | You
例2:
Hello | World | Help
COL_A | COL_B | COL_C
----------------------
Hello | World | Hi#123;move
我想要获得的输出将是:
New | line | Can.I#parse;this.data
COL_A | COL_B | COL_C
----------------------
Hello | World | Hi
Hello | World | 123
Hello | World | move
New | Line | Can
New | Line | I
New | Line | parse
New | Line | this
如果这个数据集有另一行没有Hello World并且在前两列中有世界问候,我希望显示该数据,并将相应的第三列数据解析为新行。
谢谢!
答案 0 :(得分:5)
<强>设置强>
df = pd.DataFrame({'COL_A': {0: 'Hello ', 1: 'New '},
'COL_B': {0: ' World ', 1: ' line '},
'COL_C': {0: ' Hi#123;move', 1: ' Can.I#parse;this.data '}})
Out[480]:
COL_A COL_B COL_C
0 Hello World Hi#123;move
1 New line Can.I#parse;this.data
<强>解决方案强>
#split COL_C by given delimeter and stack them up in a series
COL_C2 = df.COL_C.str.split('\.|;|#',expand=True).stack()
#join the new series (after setting a name and index) back to the dataframe
df.join(pd.Series(index=COL_C2.index.droplevel(1), data=COL_C2.values, name='COL_C2'))
Out[475]:
COL_A COL_B COL_C COL_C2
0 Hello World Hi#123;move Hi
0 Hello World Hi#123;move 123
0 Hello World Hi#123;move move
1 New line Can.I#parse;this.data Can
1 New line Can.I#parse;this.data I
1 New line Can.I#parse;this.data parse
1 New line Can.I#parse;this.data this
1 New line Can.I#parse;this.data data
答案 1 :(得分:4)
速度与优雅的融合
def pir(df, c):
colc = df[c].str.split('\.|;|#')
clst = colc.values.tolist()
lens = [len(l) for l in clst]
cdf = pd.DataFrame({c: np.concatenate(clst)}, df.index.repeat(lens))
return df.drop(c, 1).join(cdf).reset_index(drop=True)
忘记优雅,给我速度!
def pir2(df, c):
colc = df[c].str.split('\.|;|#')
clst = colc.values.tolist()
lens = [len(l) for l in clst]
j = df.columns.get_loc(c)
v = df.values
n, m = v.shape
r = np.arange(n).repeat(lens)
return pd.DataFrame(
np.column_stack([v[r, 0:j], np.concatenate(clst), v[r, j+1:]]),
columns=df.columns
)
pir(df, 'COL_C')
# pir2(df, 'COL_C')
COL_A COL_B COL_C
0 Hello World Hi
1 Hello World 123
2 Hello World move
3 New line Can
4 New line I
5 New line parse
6 New line this
7 New line data
时间
%timeit pir(df, 'COL_C')
1000 loops, best of 3: 1.42 ms per loop
%timeit pir2(df, 'COL_C')
1000 loops, best of 3: 278 µs per loop
%timeit split_list_in_cols_to_rows(df.assign(COL_C=df.COL_C.str.split(r'[.,;#]')), lst_cols='COL_C')
100 loops, best of 3: 4.16 ms per loop
%%timeit
COL_C2 = df.COL_C.str.split('\.|;|#').apply(pd.Series).stack()
df.drop('COL_C', 1).join(pd.Series(index=COL_C2.index.droplevel(1), data=COL_C2.values, name='COL_C')).reset_index(drop=True)
100 loops, best of 3: 2.81 ms per loop
设置
from io import StringIO
import pandas as pd
txt = """COL_A | COL_B | COL_C
Hello | World | Hi#123;move
New | line | Can.I#parse;this.data """
df = pd.read_csv(StringIO(txt), sep='\s*\|\s*', engine='python')
答案 2 :(得分:2)
示例1
In [107]: df
Out[107]:
COL_A COL_B COL_C
0 Hello World Hi.Can;You#Help
<强>解决方案:强>
def split_list_in_cols_to_rows(df, lst_cols, fill_value=''):
# make sure `lst_cols` is a list
if lst_cols and not isinstance(lst_cols, list):
lst_cols = [lst_cols]
# all columns except `lst_cols`
idx_cols = df.columns.difference(lst_cols)
# calculate lengths of lists
lens = df[lst_cols[0]].str.len()
return pd.DataFrame({
col:np.repeat(df[col].values, df[lst_cols[0]].str.len())
for col in idx_cols
}).assign(**{col:np.concatenate(df[col].values) for col in lst_cols}) \
.append(df.loc[lens==0, idx_cols]).fillna(fill_value) \
.loc[:, df.columns]
In [106]: split_list_in_cols_to_rows(df.assign(COL_C=df.COL_C.str.split(r'[.,;#]')),
lst_cols='COL_C')
Out[106]:
COL_A COL_B COL_C
0 Hello World Hi
1 Hello World Can
2 Hello World You
3 Hello World Help
示例2:
In [110]: df
Out[110]:
COL_A COL_B COL_C
0 Hello World Hi#123;move
1 New line Can.I#parse;this.data
In [111]: split_list_in_cols_to_rows(df.assign(COL_C=df.COL_C.str.split(r'[.,;#]')),
...: lst_cols='COL_C')
Out[111]:
COL_A COL_B COL_C
0 Hello World Hi
1 Hello World 123
2 Hello World move
3 New line Can
4 New line I
5 New line parse
6 New line this
7 New line data