Question

我有一个键值（ID，标签）格式的CSV文件，其中包含以下内容：

1，技术

2，美术;杰作

3，现代艺术

4，伪影;人工制品

5，制品

我的目标是使用python只返回ID 1,2和3，这些标签带有＆＃34; art＆＃34;显而易见的。当我使用find（）函数（myfile.find（＆＃34; art＆＃34;））时，它会找到ID为1-5的。

我的第一个想法是关注字符串周围的字符＆＃34; art＆＃34;在标签中。也许我可以使用isalpha（）函数来询问字符串＆＃34; art＆＃34;之前和之后的字符。确实是字母，而不是标点符号。但是，这是我编写的第一个python脚本之一，因此很可能会有一个REGEX在一行中执行此操作，我不知道。

非常感谢任何帮助。

Answer 1

您可以使用regex \b断言：

>>> import re
>>> pairs = ((1, "art"), (2, "fine art;masterpiece"), (3, "modern art"),
             (4, "artifact;artefact"), (5, "article"))
>>> [id for id, tag in pairs if re.search(r"\bart\b", tag)]
[1, 2, 3]

正如文档中所述，\b匹配＆＃39;字之间的边界。和一个非单词＆＃39;字符（或反之亦然），或字符与字符串的开头/结尾之间。

Answer 2

您需要构建一个实现索引逻辑的查找索引。读取您的文件，解析每个CSV行，并根据dict更新查找索引。查找索引中的每个项目都应该标准化，例如小写，并指向ID列表。

这是一个小片段：

from StringIO import StringIO

file_content = StringIO('''1,art
2,fine art;masterpiece
3,modern art
4,artifact;artefact
5,article''')

_index = {}

for line in file_content:
    # parse CSV
    (_id, _, tags) = line.strip().partition(',')


    # parse tags
    tags = tags.split(';')

    tokens = set([])

    # tokenize tags
    for tag in tags:
        for token in tag.split(' '):
            # add normalized token to tokens set
            tokens.add(token.lower())

    # update index
    for token in tokens:
        if token in _index:
            _index[token].append(_id)
        else:
            _index[token] = [_id]

# lookup tag arg in your index
print _index['art']

>>> ['1', '2', '3']

Answer 3

您可以使用此代码：

lines = ['art', 'fine art;masterpiece', 'modern art', 'artifact;artefact', 'article']
for l in lines:
    lis = [_.split(' ') for _ in l.split(';')] # Split the values.
    lis = [item for sublist in lis for item in sublist] # Flatten the list.
    print 'art' in lis # Check if 'art' is contained.

这允许您确定包含艺术（而非工件）的行。或者像这样：

lines = ['art', 'fine art;masterpiece', 'modern art', 'artifact;artefact', 'article']
idx = 1
for l in lines:
    lis = [_.split(' ') for _ in l.split(';')] # Split the values.
    lis = [item for sublist in lis for item in sublist] # Flatten the list.
    if 'art' in lis: # Check if 'art' is contained.
        print idx
    idx = idx + 1

Answer 4

简单而甜蜜：使用\ b - Word Boundaries

a = ['1,art','2,fine art;masterpiece','3,modern art','4,artifact,artefact','5,article']
for data in a:
    output = re.search(r'\bart\b',data)
    if 'art' in str(output):
       ids = re.findall('\d+', data)
       print(ids)

使用python在列表中查找显式单词

4 个答案: