Question

我有一个带有一些文字行的文本文件。我需要过滤掉所有以小写字母开头的行，并仅打印以大写字母开头的行。我如何在Python中执行此操作？

我试过这个：

filtercase =('a','b','c','d','e','f','g','h','i','j','k','l','m','n','o','p','q','r','s','t','u','v','w','x','y','z')

out = []

ins = open("data.txt","r")
for line in ins:
   for k in filtercase:
      if(not(line.startswith(k))):
           out.append(line)

如果以“a”以外的任何小写字母开头，它仍然会打印行。我不知道如何更改代码以使其工作。任何帮助表示赞赏。

EDITED：我有更多这样的禁用词列表，我需要在行上应用。所以它不仅仅是一个大写或小写的情况。

Answer 1

您的原始代码遍历filtercase中的每个字母。如果，对于每个字母，该行都不以它开头，则附加到您的列表中。但很明显，每一行都会被多次追加，因为要将一行不加到out，它必须以'a'，'b'，'c'开头，并且过滤器列表中的每个字母。

相反，您需要遍历filtercase，并且需要找到k的一个实例，其中line.startswith(k)为真。如果line.startswith中有filtercase个短语，请不要附加;但如果它成功遍历整个列表而没有从任何元素开始，则追加。

Python的for-else语法对于检查元素列表非常有用：

out = []

with open('data.txt', 'r') as ins:
    for line in ins:
        for k in filtercase:
            if line.startswith(k): # If line starts with any of the filter words
                break # Else block isn't executed.
        else: # Line doesn't start with filter word, append to message
            out.append(line)

Answer 2

以下方法应该有效。

with open('data.txt', 'r') as ins:
    out = filter(lambda line: [sw for sw in filtercase if line.startswith(sw)] == [], ins.readlines())

Answer 3

此解决方案使用regexp，并且只匹配以大写字母开头的行，并且不包含stopword中的任何单词。请注意，例如如果其中一个停用词为'messenger'，则'me'行将不会匹配。

import re

out = []
stopwords = ['no', 'please', 'dont']
lower = re.compile('^[a-z]')
upper = re.compile('^[A-Z]')
with open('data.txt') as ifile:
    for line in ifile:
        if (not lower.match(line) and
            not any(word in line for word in stopwords)) \
            and upper.match(line):
           out.append(line)

Answer 4

这有效

fp = open("text.txt","r")
out = []
yesYes = xrange(ord('A'),ord('Z')+1)
noNo = xrange(ord('a'),ord('z')+1)
for line in fp:
    if len(line)>0 and ord(line[0]) in yesYes and ord(line[0]) not in noNo:
         out.append(line)

或单行 -

out = [line for line in open("text.txt","r") if len(line)>0 and ord(line[0]) in xrange(ord('A'),ord('Z')+1) and ord(line[0]) not in xrange(ord('a'),ord('z')+1)]

Answer 5

通过使用小写字母的ascii代码范围来检查小写可能非常快。一旦这样优化，您可以将所有停用词放在一个集合中（以便更快地查找）。这将产生以下代码：

lowers = (ord('a'), ord('z'))
stopWords = set((i.lower() for i in "firstWord anotherWord".split()))
out = []
with open('data.txt') as infile:
    for line in infile:
        if lowers[0] <= line[0] <= lowers[1]:
            continue
        if line.split(None, 1)[0].lower() in stopWords:
            continue
        out.append(line)

使用python中的停用词过滤文件中的行

5 个答案: