使用正则表达式(不规则的正则表达式密钥)提取代码

时间:2019-07-01 06:52:24

标签: python regex python-3.x pandas dataframe

我正在使用来自标题电子邮件的字符串列表中的代码。看起来像这样:

Recyclerview

到目前为止,我尝试过的是:

public class GetAttributesResponse {

@SerializedName("statusCode")
@Expose
private int statusCode;

@SerializedName("success")
@Expose
private boolean success;

@SerializedName("message")
@Expose
private String message;

@SerializedName("data")
@Expose
private ArrayList<DataClass> dataclass = null;

public int getStatusCode() {
    return statusCode;
}

public void setStatusCode(int statusCode) {
    this.statusCode = statusCode;
}

public boolean isSuccess() {
    return success;
}

public void setSuccess(boolean success) {
    this.success = success;
}

public String getMessage() {
    return message;
}

public void setMessage(String message) {
    this.message = message;
}

public ArrayList<DataClass> getDataclass() {
    return dataclass;
}

public void setDataclass(ArrayList<DataClass> dataclass) {
    this.dataclass = dataclass;
}

public class DataClass {

    @SerializedName("attributes_id")
    @Expose
    private int attributes_id;

    @SerializedName("category_id")
    @Expose
    private String category_id;

    @SerializedName("subcategory_id")
    @Expose
    private String subcategory_id;

    @SerializedName("product_id")
    @Expose
    private String product_id;

    @SerializedName("attribute_name")
    @Expose
    private String attribute_name;

    @SerializedName("isRequired")
    @Expose
    private String isRequired;

    @SerializedName("attribute_type")
    @Expose
    private String attribute_type;

    @SerializedName("created_at")
    @Expose
    private String created_at;

    @SerializedName("updated_at")
    @Expose
    private String updated_at;

    @SerializedName("attribute_options")
    @Expose
    private ArrayList<String> attribute_options = null;

    public int getAttributes_id() {
        return attributes_id;
    }

    public void setAttributes_id(int attributes_id) {
        this.attributes_id = attributes_id;
    }

    public String getCategory_id() {
        return category_id;
    }

    public void setCategory_id(String category_id) {
        this.category_id = category_id;
    }

    public String getSubcategory_id() {
        return subcategory_id;
    }

    public void setSubcategory_id(String subcategory_id) {
        this.subcategory_id = subcategory_id;
    }

    public String getProduct_id() {
        return product_id;
    }

    public void setProduct_id(String product_id) {
        this.product_id = product_id;
    }

    public String getAttribute_name() {
        return attribute_name;
    }

    public void setAttribute_name(String attribute_name) {
        this.attribute_name = attribute_name;
    }

    public String getIsRequired() {
        return isRequired;
    }

    public void setIsRequired(String isRequired) {
        this.isRequired = isRequired;
    }

    public String getAttribute_type() {
        return attribute_type;
    }

    public void setAttribute_type(String attribute_type) {
        this.attribute_type = attribute_type;
    }

    public String getCreated_at() {
        return created_at;
    }

    public void setCreated_at(String created_at) {
        this.created_at = created_at;
    }

    public String getUpdated_at() {
        return updated_at;
    }

    public void setUpdated_at(String updated_at) {
        this.updated_at = updated_at;
    }

    public ArrayList<String> getAttribute_options() {
        return attribute_options;
    }

    public void setAttribute_options(ArrayList<String> attribute_options) {
        this.attribute_options = attribute_options;
    }
}

我的问题是,我无法提取text_list = ['Industry / Gemany / PN M564839', 'Industry / France / PN: 575-439', 'Telecom / Gemany / P/N 26-59-29', 'Mobile / France / P/N: 88864839'] 之前的单词旁边的代码,特别是如果后面的代码以字母(例如'M')开头或斜线之间(即26-59-29)。

我想要的输出是:

def get_p_number(text):
    rx = re.compile(r'[p/n:]\s+((?:\w+(?:\s+|$)){1})',
                    re.I)
    res = []
    m = rx.findall(text)
    if len(m) > 0:
        m = [p_number.replace(' ', '').upper() for p_number in m]
        m = remove_duplicates(m)
        res.append(m)
    else:
        res.append('no P Number found')
    return res

2 个答案:

答案 0 :(得分:1)

在您的模式中,字符类[p/n:]\s+将与列出的字符之一匹配,后跟1+个空格字符。在示例数据中,将匹配正斜杠或冒号,后跟空格的数据。

下一部分(?:\w+(?:\s+|$))将匹配1+个单词字符,后跟字符串的末尾或1+个空格字符,而不考虑中间的空格字符或连字符。

一种选择是将PN与可选的:/匹配,而不是使用字符类[p/n:]并将您的值分配到捕获组中:

/ P/?N:? ([\w-]+)

Regex demo | Python demo

例如:

import re
text_list = ['Industry / Gemany / PN M564839', 'Industry / France / PN: 575-439', 'Telecom / Gemany / P/N 26-59-29', 'Mobile / France / P/N: 88864839']
regex = r"/ P/?N:? ([\w-]+)"
res = []
for text in text_list: 
    matches = re.search(regex, text)
    if matches:
        res.append(matches.group(1))

print(res)

结果

['M564839', '575-439', '26-59-29', '88864839']

答案 1 :(得分:1)

简单模式M?[-\d]+应该适合您。这是一个演示:

import re

text_list = ['Industry / Gemany / PN M564839', 'Industry / France / PN: 575-439', 'Telecom / Gemany / P/N 26-59-29', 'Mobile / France / P/N: 88864839']

res = []
for elem in text_list:
    for code in re.findall(r'M?[-\d]+', elem):
        res.append(code)

print(res)
  

输出:

['M564839', '575-439', '26-59-29', '88864839']