Question

以下是我用来过滤掉任何西班牙文字的代码：

from langdetect import detect  #detects what language is written
from  tqdm import tqdm #timing package

# 'summary_processed' is a list of sentence strings that had general text preprocessing done (lemmetization, regex removal, lowercasing, etc)
summary_processed_en = [i for i in tqdm(summary_processed) if detect(i) == 'en']

现在，这不是一个典型的条件语句，因此我无法执行正常的df[df == "X"]格式。

我不完全确定如何处理这个问题。任何帮助将非常感激。

Answer 1

你可以使用apply和lambda很容易地做到这一点。

index = df['a'].apply(lambda x: detect(x) == 'en')

然后您可以将索引应用于您想要的任何列。或者你可以做

df['a'] == df['a'].apply(lambda x: detect(x) == 'en')

在同一列上进行。

循环通过DF列删除具有西班牙文本的行

1 个答案: