Question

我正在尝试创建一个有条件地使用正则表达式的函数。我正在尝试提取有关产品的属性信息，并且我已经概括了一些可以帮助我提取数据的不同模式。

到目前为止我的工作代码是：

#!/usr/bin/python
# -*- coding: utf-8 -*-
import os, sys
import re


filename = '/PATH/TO/dataFILE'
with open(filename) as f:
    for line in f:
        m0 = re.compile('[a-z-A-Z-0-9--]+\s\([a-z-A-Z]+,\s[-0-9-]+\)')
        m1 = re.compile('[a-z-A-Z-0-9--]+\s\([0-9-]+,\s[a-z-A-Z-]+\)')
        if m0.findall(line):
            matching_words = m0.findall(line)
            for word in matching_words:
                cleanwords = [x.strip(string.punctuation) for x in word.split()]
                if len(cleanwords[0]) > 2:
                    print 'Product: ' + cleanwords[1] +'\n' + 'Attribute: '+cleanwords[0]

到目前为止，代码正常工作并正确输出 - 当我添加`elif`时，我遇到了问题

        elif m1.findall(line):
            matching_words = m1.findall(line)
            for word in matching_words:
                cleanwords = [x.strip(string.punctuation) for x in word.split()]
                if len(cleanwords[0]) > 2:
                    print 'Product: ' + cleanwords[2] +'\n' + 'Attribute: '+cleanwords[0]

我正在使用的数据文件的一个例子是（我提供并行虚拟数据）：

The cellphone DeluxeModel (Samsung, 2007) is the best on the market. It is possible that the LightModel (Apple, 2010) is also relevant. It has been said that NewModel (1997,Blackberry) could also be useful - but I don't know.

期望的结果是

Company: Samsung Product: DeluxeModel
Company: Apple Product: LightModel
Company: Blackberry Product: NewModel

我已经就我想要实现的内容的级联和分组方法咨询了HERE和HERE，但我无法理解为什么我的实现不正确。有没有办法让我调整我的代码以提供所需的结果？

更新代码

我一直在尝试不同的修改 - 而且我已经能够输出结果，但是，每次添加新条件时，结果都会受到更多限制，是否有任何可以优化的方法？

filename = '/PATH/TO/DATA'
with open(filename) as f:
    for line in f:
        m0 = re.compile('[a-z-A-Z-0-9--]+\s\([a-z-A-Z-0-9--]+,\s[a-z-A-Z-0-9--]+\) | [a-z-A-Z-0-9--]+\s\([A-Z][a-z-]+\)' )
        m1 = re.compile('[a-zA-Z0-9-]+\s\(>[0-9]+.[0-9]\%,\s[a-zA-Z0-9-]+\)')
        m2 = re.compile('[a-zA-Z0-9-]+\s\([a-zA-Z0-9-]+\),\s>[0-9]+.[0-9]\%')
        if m0.findall(line):
            matching_words = m0.findall(line)
            for word in matching_words:
                cleanwords = [x.strip(string.punctuation) for x in word.split()]
                if len(cleanwords[0]) > 2:
                    print 'Company: ' + cleanwords[1] +'\n' + 'Product: '+cleanwords[0]
        if m1.findall(line):
            matching_words = m1.findall(line)
            for word in matching_words:
                cleanwords = [x.strip(string.punctuation) for x in word.split()]
                if len(cleanwords[0]) > 2:
                    print 'Company: ' + cleanwords[2] +'\n' + 'Product: '+cleanwords[0]
        if m2.findall(line):
            matching_words = m2.findall(line)
            for word in matching_words:
                cleanwords = [x.strip(string.punctuation) for x in word.split()]
                if len(cleanwords[0]) > 2:
                    print 'Company: ' + cleanwords[1] +'\n' + 'Product: '+cleanwords[0]

Answer 1

使用单个正则表达式，if...elif是不必要的。

import re

line='The cellphone DeluxeModel (Samsung, 2007) is the best on the market. It is possible that the LightModel (Apple, 2010) is also relevant. It has been said that NewModel (1997,Blackberry) could also be useful - but I don\'t know.'
t=re.compile('(\w+)\s\((\d+,)?\s?(\w+)')
q=t.findall(line)
for match in q:
  print('Company: {} Product: {}'.format(match[2],match[0]))

输出：

Company: Samsung Product: DeluxeModel
Company: Apple Product: LightModel
Company: Blackberry Product: NewModel

更新：使用带正则表达式的条件语句

到目前为止，代码正常工作并正确输出 - 当我添加`elif`时，我遇到了问题

更新代码

1 个答案:

更新：使用带正则表达式的条件语句

到目前为止，代码正常工作并正确输出 - 当我添加elif时，我遇到了问题

更新代码

1 个答案:

到目前为止，代码正常工作并正确输出 - 当我添加`elif`时，我遇到了问题