Question

我正在尝试通过用逗号分割'text'列来组织数据帧，例如：transcript_dataframe['text'].str.split(pat = delimiter, expand = True)。但是，当在具有几百万行的数据帧上运行时，该过程非常慢。我想知道是否真的有一种更快的方法，是否有可能在此方法周围包裹一个tqdm进度条以查看进度。

此外，由于我要遍历几百万行，因此您可以看到我使用了apply大约四次（这意味着我要遍历所有几百万行四次）。有没有办法在一个循环中完成所有这些处理？我想要输出的是带有以下内容的数据框：

RecordID (string, removed BOM)

Content (string with blank or pipe characters removed)

call_time_seconds (end time - call time, after converting to float, np.nan if error)

count_calls (just 1 throughout)

最后，我要删除其中包含此行的所有'RecordIDs'的{{1}}：

'M'

以下是我的代码：

transcripts_df = transcripts_df[transcripts_df['RecordID'].progress_apply(lambda x: bool(re.match('M', str(x)))) != True]

感谢任何帮助，谢谢。

Pandas DataFrame有效地将不同的功能应用于多个列

0 个答案: