熊猫:Groupby基于熊猫列中的匹配子字符串

时间:2019-03-05 11:00:36

标签: python pandas dataframe group-by

我有一个包含类似元素的列表:

emails= ['xyz.com', 'abc.com','def.com']

现在,我有一个数据框,看起来像:

df:

UserID    Email_Address
U001      u001@abc.com
U002      u002@xyz.com
U003      u003@xyz.com
U004      u004@abc.com
U004      u005@def.com
U006      u006@def.com
U007      u007@def.com

我想基于子字符串对groupby进行计数,其中子字符串是列表中的元素。

因此,输出应如下所示:

abc.com     2
def.com     3
xyz.com     2

我当前的代码:

for domain in list1:
    count = df.groupby( [df.Email_Address.str.find(domain)]).sum()

3 个答案:

答案 0 :(得分:2)

使用Series.str.extract通过列表获取值并通过GroupBy.size进行汇总:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Xml;
using System.Xml.Linq;
using System.IO;

namespace ConsoleApplication103
{
    class Program
    {
        const string INPUT_XML = @"c:\temp\test.xml";
        const string OUTPUT_CSV = @"c:\temp\test.csv";
        const string INPUT_CSV = @"c:\temp\test2.csv";
        static void Main(string[] args)
        {
            XDocument doc = XDocument.Load(INPUT_XML);

            var colorsWithDuplicates = doc.Descendants("namespace")
                .SelectMany(ns => ns.Elements()
                .SelectMany(color => color.Elements().Select(y => new {color = color.Name.LocalName,  language = y.Name.LocalName, value = (string)y}))
                ).ToList();

            var colors = colorsWithDuplicates.GroupBy(x => new object[] { x.color, x.language }).Select(x => x.First()).ToList();

            var sortedAndGrouped = colors.OrderBy(x => x.language).ThenBy(x => x.color).GroupBy(x => x.color).ToList();

            List<string> countries = sortedAndGrouped.FirstOrDefault().Select(x => x.language).ToList();

            StreamWriter writer = new StreamWriter(OUTPUT_CSV, false, Encoding.Unicode);
            writer.WriteLine(string.Join(",",countries));

            foreach (var color in sortedAndGrouped)
            {
                writer.WriteLine(string.Join(";",color.Select(x => x.value)));
            }
            writer.Flush();
            writer.Close();

            StreamReader reader = new StreamReader(INPUT_CSV);

            List<string> newCountries = reader.ReadLine().Split(new char[] { ';' }, StringSplitOptions.RemoveEmptyEntries).ToList();
            string line = "";
            Dictionary<string, List<string>> dict = new Dictionary<string, List<string>>();
            while ((line = reader.ReadLine()) != null)
            {
                line = line.Trim();
                List<string> splitLine = line.Split(new char[] { ';' }, StringSplitOptions.RemoveEmptyEntries).ToList();
                dict.Add(splitLine[0], splitLine);
            }

            //now replace colors
            foreach (XElement xNs in doc.Descendants("namespace"))
            {
                string name = (string)xNs.Attribute("name");
                if((name == "Colors") || (name == "Subcolors"))
                {
                    foreach (XElement xColor in xNs.Elements())
                    {
                        if (xColor.Name.LocalName != "namespace")
                        {

                            string checkColor = xColor.Name.LocalName;
                            checkColor = (string)xColor.Element("en");  // use english name
                            if (checkColor != null)
                            {
                                List<string> inputColors = dict[checkColor];
                                for (int index = 0; index < inputColors.Count; index++)
                                {
                                    XElement country = xColor.Element(newCountries[index]);
                                    if (country == null)
                                    {
                                        xColor.Add(new XElement(newCountries[index], inputColors[index]));
                                    }
                                }
                            }
                        }
                    }
                }
                else
                {
                    foreach (XElement group in xNs.Elements())
                    {
                        foreach(XElement xColor in group.Elements())
                        {

                            string checkColor = xColor.Name.LocalName;
                            checkColor = char.ToUpper(checkColor[0]) + checkColor.Substring(1);
                            if (checkColor != null)
                            {
                                List<string> inputColors = dict[checkColor];
                                for (int index = 0; index < inputColors.Count; index++)
                                {
                                    XElement country = xColor.Element(newCountries[index]);
                                    if (country == null)
                                    {
                                        xColor.Add(new XElement(newCountries[index], inputColors[index]));
                                    }
                                }
                            }
                        }
                    }
                }
            }

        }
    }


}

答案 1 :(得分:0)

要了解数据框中特定值的出现,可以使用:

len(df[df['Email_Address'] == your_value])

所以我认为您正在寻找类似的东西:

for domain in list1:
    len(df[df['Email_Address'] == domain])  # Save this value whatever you want

答案 2 :(得分:0)

def mapf(x):
    if x[x.find('@')+1:] in emails:
        return x[x.find('@')+1:]

data['Email_Address'].apply(mapf).value_counts()  

当字符串与电子邮件不匹配时,函数返回None。因此,它仅计算匹配的电子邮件。

输出类似:

def.com    3
abc.com    2
xyz.com    2
Name: Email, dtype: int64