Question

我有一个熊猫数据框，其中的一列包含一些字符串。我想根据字数将该列拆分为未知的列数。

假设我有DataFrame df：

Index        Text
0          He codes
1          He codes well in python
2          Python is great language
3          Pandas package is very handy

现在，我想将文本列分为多列，每列各包含2个单词。

Index         0                 1                 2
0          He codes          NaN               NaN
1          He codes          well in           python
2          Python is         great language    NaN
3          Pandas package    is very           handy

如何在python中执行此操作？请帮忙。预先感谢。

Answer 1

给出一个数据框df，在Text列中，我们需要将句子分为两个单词：

import pandas as pd

def splitter(s):
    spl = s.split()
    return [" ".join(spl[i:i+2]) for i in range(0, len(spl), 2)]

df_new = pd.DataFrame(df["Text"].apply(splitter).to_list())

#           0        1       2
# 0  He codes     well    None
# 1  He codes  well in  Python

Answer 2

IIUC，我们可以str.split groupby cumcount进行楼层划分和unstack

s = (
    df["Text"]
    .str.split("\s", expand=True)
    .stack()
    .to_frame("words")
    .reset_index(1, drop=True)
)
s["count"] = s.groupby(level=0).cumcount() // 2
final = s.rename_axis("idx").groupby(["idx", "count"])["words"].agg(" ".join).unstack(1)

print(final)

count               0               1       2
idx                                          
0            He codes             NaN     NaN
1            He codes         well in  python
2           Python is  great language     NaN
3      Pandas package         is very   handy

根据单词数-熊猫

2 个答案: