Question

我有一个包含一些列的 Pandas 数据框，“input_text”列每行大约有 8K 个字。我的目的是将每一行分成更多行，每行包含来自原始行 input_text 的 500 个单词。例如，如果我们使用 2 个单词而不是 500 个，则这一行：

a | b | c | d | input_text
1   2   3   4       'Hello this is text hello how r u'

会变成 4 行：

a | b | c | d | input_text
1   2   3   4       'Hello this'
a | b | c | d | input_text
1   2   3   4       'is text'
a | b | c | d | input_text
1   2   3   4       'hello how'
a | b | c | d | input_text
1   2   3   4       'r u'

但我需要这个来处理 500 字。

代码：

import pandas as pd
df = pd.read_csv('data.csv')
# function

注意！我使用的数据框真的很大，所以速度在这里很重要。

Answer 1

设置

print(df)

   a  b  c  d                        input_text
0  1  2  3  4  Hello this is text hello how r u

使用 `findall` 和 `explode` 的方法

df['input_text'] = df['input_text'].str.findall(r'((?:\S+\s?){1,2})(?:\s|$)')
df = df.explode('input_text')

正则表达式详情

((?:\S+\s?){1,2})：第一个捕获组
- (?:\S+\s?)：非捕获组
  - \S+\s? : 匹配一个或多个非空白字符后跟零个或一个空格
  - {1, 2}：在一次或两次之间匹配前一个标记
(?:\s|$) : 非捕获组
- \s|$ : 匹配单个空格字符或断言行尾位置

见online regex demo

结果

print(df)

   a  b  c  d  input_text
0  1  2  3  4  Hello this
0  1  2  3  4     is text
0  1  2  3  4   hello how
0  1  2  3  4         r u

注意：为了按 500 单词拆分，请将正则表达式模式中的 2 替换为 500

Answer 2

你能试试这个吗？首先将单词拆分为列表并将其存储在列中。您可以将 number_to_split 更改为 500 以拆分为 500

number_to_split =2
def split_text(string):
    words = string.split()
    grouped_words = [' '.join(words[i: i + number_to_split]) for i in range(0, len(words),number_to_split)]
    return grouped_words

df['new_col'] = df[' input_text'].apply(split_text)

然后像这样为列表的每个值重复行？

df_new = df.new_col.apply(pd.Series).stack().rename('new').reset_index()
pd.merge(df_new,df,left_on='level_0',right_index=True, suffixes=(['','_old']))[df.columns]

Answer 3

试试：

def splitter(input_text,n=2):
  values = input_text.split()
  return [values[i:i+n] for i in range(0, len(values), n)]

df['input_text'] = df['input_text'].astype(str).apply(lambda x:splitter(x))
df = df.explode('input_text')
df['input_text'] = df['input_text'].apply(lambda x: ' '.join(x))

熊猫根据列条件拆分行

3 个答案:

设置

使用 `findall` 和 `explode` 的方法

结果

熊猫根据列条件拆分行

3 个答案:

设置

使用 findall 和 explode 的方法

结果

使用 `findall` 和 `explode` 的方法