使用正则表达式识别熊猫列中的模式并清除数据

时间:2019-04-21 01:59:02

标签: regex pandas

我有一个包含公司创新数据的数据集,通过使用一些正则表达式,我想检索许可证数据

company licences/patents
1       UX226, licence-pp-zz, licence-zz-pp, licence-xx-tt
2       VV3346E, SS345
3       licence-dd-zz
4       UT223, licence, ss
5       XBTYU, licence-tt-kk, licence-ss-tt
6       xc, zz
7       licence-xb-xz

所需的输出:

company licences/patents                                    licence
1       UX226, licence-pp-zz, licence-zz-pp, licence-xx-tt  licence-pp-zz, licence-zz-pp, licence-xx-tt 
2       VV3346E, SS345
3       licence-dd-zz                                       licence-dd-zz
4       UT223, licence, ss
5       XBTYU, licence-tt-kk, licence-ss-tt                 licence-tt-kk, licence-ss-tt
6       xc, zz
7       licence-xb-xz                                       licence-xb-xz

5 个答案:

答案 0 :(得分:3)

您可以尝试:

df['licence'] = df['licences/patents'].str.extractall('(licence-\w{2}-\w{2})')\
  .unstack().apply(lambda x: ', '.join(x.dropna()), axis=1)

输出:

   company                                   licences/patents                                      licence
0        1  UX226, licence-pp-zz, licence-zz-pp, licence-x...  licence-pp-zz, licence-zz-pp, licence-xx-tt
1        2                                     VV3346E, SS345                                          NaN
2        3                                      licence-dd-zz                                licence-dd-zz
3        4                                 UT223, licence, ss                                          NaN
4        5                XBTYU, licence-tt-kk, licence-ss-tt                 licence-tt-kk, licence-ss-tt
5        6                                             xc, zz                                          NaN
6        7                                      licence-xb-xz                                licence-xb-xz

答案 1 :(得分:2)

另一种方法,使用Series.str.findallSeries.str.join

$this->db->where('amount', 'discount', false);

[出]

using ExcelDataReader;
using System;
using System.Collections.Generic;
using System.Data;
using System.IO;
using System.Linq;

namespace Driven
{
    public class ExcelLib
    {
        public static DataTable ExcelToDataTable(string fileName)
        {
            //open file and returns as Stream
            FileStream stream = File.Open(fileName, FileMode.Open, FileAccess.Read);
            {
                using (var reader = ExcelReaderFactory.CreateReader(stream))
                {
                    var result = reader.AsDataSet(new ExcelDataSetConfiguration()
                    {
                        ConfigureDataTable = (data) => new ExcelDataTableConfiguration()
                        {
                            UseHeaderRow = true
                        }
                    });

                    //Get all the Tables
                    DataTableCollection table = result.Tables;
                    //Store it in DataTable
                    DataTable resultTable = table["Sheet1"];

                    //return
                    return resultTable;
                }
            }
        }

            static List<Datacollection> dataCol = new List<Datacollection>();

        public static void PopulateInCollection(string fileName)
        {
            DataTable table = ExcelToDataTable(fileName);

            //Iterate through the rows and columns of the Table
            for (int row = 1; row <= table.Rows.Count - 1; row++)
            {
                for (int col = 0; col <= table.Columns.Count; col++)
                {
                    Datacollection dtTable = new Datacollection()
                    {
                        rowNumber = row,
                        colName = table.Columns[col].ColumnName,
                        colValue = table.Rows[row - 1][col].ToString()
                    };
                    //Add all the details for each row
                    dataCol.Add(dtTable);
                }
            }
        }

        public static string ReadData(int rowNumber, string columnName)
        {
            try
            {
                //Retriving Data using LINQ to reduce much of iterations
                string data = (from colData in dataCol
                               where colData.colName == columnName && colData.rowNumber == rowNumber
                               select colData.colValue).SingleOrDefault();

                //var datas = dataCol.Where(x => x.colName == columnName && x.rowNumber == rowNumber).SingleOrDefault().colValue;
                return data.ToString();
            }
            catch (Exception e)
            {
                Console.WriteLine(e.Message);
                return null;
            }
        }
        public class Datacollection
        {
            public int rowNumber { get; set; }
            public string colName { get; set; }
            public string colValue { get; set; }
        }
    }
}

答案 2 :(得分:1)

This RegEx可能会帮助您创建一个组$1,其中所需的输出可能是:

(licence-[a-z]{2}-[a-z]{2})

RegEx

答案 3 :(得分:1)

function formikCustomValidation({ field, form }: FieldProps, ownerCmp: Component<any, any>) { return { ...field, onBlur: e => { if (field.value !== form.initialValues[field.name]) { ownerCmp.setState({ changedFields: { ...ownerCmp.state.changedFields, [field.name]: true } }); } field.onBlur(e); setTimeout(form.validateForm); }, onChange: e => { field.onChange(e); if (ownerCmp.state.changedFields[field.name]) { setTimeout(form.validateForm); } } }; }

它将创建新的列许可证,并从原始列中剥离许可证。

答案 4 :(得分:0)

尝试以下代码:

df['licences/patents'].str.findall('(licence-[\w\-]+)').apply(", ".join)