我有一个包含类似元素的列表:
emails= ['xyz.com', 'abc.com','def.com']
现在,我有一个数据框,看起来像:
df:
UserID Email_Address
U001 u001@abc.com
U002 u002@xyz.com
U003 u003@xyz.com
U004 u004@abc.com
U004 u005@def.com
U006 u006@def.com
U007 u007@def.com
我想基于子字符串对groupby进行计数,其中子字符串是列表中的元素。
因此,输出应如下所示:
abc.com 2
def.com 3
xyz.com 2
我当前的代码:
for domain in list1:
count = df.groupby( [df.Email_Address.str.find(domain)]).sum()
答案 0 :(得分:2)
使用Series.str.extract
通过列表获取值并通过GroupBy.size
进行汇总:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Xml;
using System.Xml.Linq;
using System.IO;
namespace ConsoleApplication103
{
class Program
{
const string INPUT_XML = @"c:\temp\test.xml";
const string OUTPUT_CSV = @"c:\temp\test.csv";
const string INPUT_CSV = @"c:\temp\test2.csv";
static void Main(string[] args)
{
XDocument doc = XDocument.Load(INPUT_XML);
var colorsWithDuplicates = doc.Descendants("namespace")
.SelectMany(ns => ns.Elements()
.SelectMany(color => color.Elements().Select(y => new {color = color.Name.LocalName, language = y.Name.LocalName, value = (string)y}))
).ToList();
var colors = colorsWithDuplicates.GroupBy(x => new object[] { x.color, x.language }).Select(x => x.First()).ToList();
var sortedAndGrouped = colors.OrderBy(x => x.language).ThenBy(x => x.color).GroupBy(x => x.color).ToList();
List<string> countries = sortedAndGrouped.FirstOrDefault().Select(x => x.language).ToList();
StreamWriter writer = new StreamWriter(OUTPUT_CSV, false, Encoding.Unicode);
writer.WriteLine(string.Join(",",countries));
foreach (var color in sortedAndGrouped)
{
writer.WriteLine(string.Join(";",color.Select(x => x.value)));
}
writer.Flush();
writer.Close();
StreamReader reader = new StreamReader(INPUT_CSV);
List<string> newCountries = reader.ReadLine().Split(new char[] { ';' }, StringSplitOptions.RemoveEmptyEntries).ToList();
string line = "";
Dictionary<string, List<string>> dict = new Dictionary<string, List<string>>();
while ((line = reader.ReadLine()) != null)
{
line = line.Trim();
List<string> splitLine = line.Split(new char[] { ';' }, StringSplitOptions.RemoveEmptyEntries).ToList();
dict.Add(splitLine[0], splitLine);
}
//now replace colors
foreach (XElement xNs in doc.Descendants("namespace"))
{
string name = (string)xNs.Attribute("name");
if((name == "Colors") || (name == "Subcolors"))
{
foreach (XElement xColor in xNs.Elements())
{
if (xColor.Name.LocalName != "namespace")
{
string checkColor = xColor.Name.LocalName;
checkColor = (string)xColor.Element("en"); // use english name
if (checkColor != null)
{
List<string> inputColors = dict[checkColor];
for (int index = 0; index < inputColors.Count; index++)
{
XElement country = xColor.Element(newCountries[index]);
if (country == null)
{
xColor.Add(new XElement(newCountries[index], inputColors[index]));
}
}
}
}
}
}
else
{
foreach (XElement group in xNs.Elements())
{
foreach(XElement xColor in group.Elements())
{
string checkColor = xColor.Name.LocalName;
checkColor = char.ToUpper(checkColor[0]) + checkColor.Substring(1);
if (checkColor != null)
{
List<string> inputColors = dict[checkColor];
for (int index = 0; index < inputColors.Count; index++)
{
XElement country = xColor.Element(newCountries[index]);
if (country == null)
{
xColor.Add(new XElement(newCountries[index], inputColors[index]));
}
}
}
}
}
}
}
}
}
}
答案 1 :(得分:0)
要了解数据框中特定值的出现,可以使用:
len(df[df['Email_Address'] == your_value])
所以我认为您正在寻找类似的东西:
for domain in list1:
len(df[df['Email_Address'] == domain]) # Save this value whatever you want
答案 2 :(得分:0)
def mapf(x):
if x[x.find('@')+1:] in emails:
return x[x.find('@')+1:]
data['Email_Address'].apply(mapf).value_counts()
当字符串与电子邮件不匹配时,函数返回None。因此,它仅计算匹配的电子邮件。
输出类似:
def.com 3
abc.com 2
xyz.com 2
Name: Email, dtype: int64