Question

我正在尝试删除列表中的特定字词以及文本文件中的<title>和<\title>。

我还需要删除列表中包含的字词words=[a,is,and,there,here]

我的列表lines包含以下文字： -

lines= [<title>The query complexity of estimating weighted averages.</title>', '<title>New bounds for the query complexity of an algorithm that learns DFAs with correction and equivalence queries.</title>', '<title>A general procedure to check conjunctive query containment.</title>]

请帮我删除列表中包含的字词以及

Answer 1

通过re.sub功能。

>>> lines= ['<title>The query complexity of estimating weighted averages.</title>', '<title>New bounds for the query complexity of an algorithm that learns DFAs with correction and equivalence queries.</title>', '<title>A general procedure to check conjunctive query containment.</title>']
>>> words=['a','is','and','there','here']
>>> [re.sub(r'</?title>|\b(?:'+'|'.join(words)+r')\b', r'', line) for line in lines]
['The query complexity of estimating weighted averages.', 'New bounds for the query complexity of an algorithm that learns DFAs with correction  equivalence queries.', 'A general procedure to check conjunctive query containment.']

单词之前和之后的

\b将有助于进行精确的单词匹配。 \b称为单词边界，它在单词字符和非单词字符之间匹配。

Answer 2

您可以在不使用正则表达式的情况下更有效地执行此操作：

lines = ['<title>The query complexity of estimating weighted averages.</title>',
         '<title>New bounds for the query complexity of an algorithm that learns DFAs with correction and equivalence queries.</title>',
         '<title>A general procedure to check conjunctive query containment.</title>']
words = {"a", "is", "and", "there", "here"}

print([" ".join([w for line in lines
             for w in line[7:-8:].split(" ")
             if w.lower() not in words])])


['The query complexity of estimating weighted averages.
 New bounds for the query complexity of an algorithm that learns 
 DFAs with correction equivalence queries.
 general procedure to check conjunctive query containment.']

如果案例有问题，请删除w.lower()调用。如果您通过解析网页来提取行，我建议您在写入文件之前从标记中提取文本。

Answer 3

首先，您应该始终发布到目前为止您尝试过的内容。

仅使用内置库：

for i in range(0, len(lines)-1):
    for it in range(0, len(words)-1):
        lines[i] = lines[i].replace(words[it], '')

代码由行解释：

对于列表'lines'中的每个项目，i =当前行的项目编号
对于“单词”列表中的每个项目，它=“单词”中当前单词的项目编号;用''
列表'lines'中的当前项目将更改为自身，而不包含'words'

Answer 4

lines=['<title>The query complexity of estimating weighted averages.</title>', '<title>New bounds for the query complexity of an algorithm that learns DFAs with correction and equivalence queries.</title>', '<title>A general procedure to check conjunctive query containment.</title>']

words = [' a ', ' is ', ' and ', ' there ', ' here ', '<title>', '</title>']

我在要移除的每个单词之前和之后添加空格，以确保删除单词而不是单词，这不会掩盖句子中有逗号或点的情况，或者最后一个单词是否在名单。此外，这是区分大小写的。

之后，就这样做：

for i in words:
  for j in range(0,len(lines)):
    lines[j]=lines[j].replace(i,'')

Answer 5

假设您从这开始（略微修复）：

lines=  ['<title>The query complexity of estimating weighted averages.</title>', '<title>New bounds for the query complexity of an algorithm that learns DFAs with correction and equivalence queries.</title>', '<title>A general procedure to check conjunctive query containment.</title>']

想要删除特定的单词/字符序列：

remove_words = ['a', 'is', 'and', 'there', 'here', '<title>', '</title>']

你可以这样做：

trimmed_lines = []
for line in lines:
    trimmed_lines.append(' '.join([w for w in line.split() if w not in remove_words]))

从列表中删除特定单词

5 个答案: