我有一个由列节点,组件和前一个单词组成的数据框。节点包含许多相同的值(按字母顺序排序),组件也包含许多相同的值,但是加扰,前面的单词可以是所有类型的单词 - 但也有一些相同。
我现在要做的是创建某种横截面/频率列表,显示组件的频率以及链接到节点的前一个字。
让我们说这是我的df:
node precedingWord comp
banana the lel
banana a lel
banana a lal
coconut some lal
coconut few lil
coconut the lel
我期待一个显示每个唯一节点的频率列表,以及在给定匹配条件的其他列中找到某个值的时间,例如
det1 = a
det2 = the
comp1 = lel
comp2 = lil
comp 3 = lal
预期产出:
node det1 det2 unspecified comp1 comp2 comp3
banana 2 1 0 2 0 1
coconut 0 1 0 1 1 1
我已经为一个变量做过,但我不知道如何获得comp列:
det1 = ["a"]
det2 = ["the"]
df.loc[df.preceding_word.isin(det1), "determiner"] = "det1"
df.loc[df.preceding_word.isin(det2), "determiner"] = "det2"
df.loc[df.preceding_word.isin(det1 + det2) == 0, "determiner"] = "unspecified"
# Create crosstab of the node and gender
freqDf = pd.crosstab(df.node, df.determiner)
我从here得到了这个答案。如果有人能解释loc
的作用,那也会有很多帮助。
考虑到安迪的回答,我尝试了以下内容。请注意,“precedingWord”已被“gender”取代,“gender”仅包含中性,非中性,性别的值。
def frequency_list():
# Define content of gender classes
neuter = ["het"]
non_neuter = ["de"]
# Add `gender` column to df
df.loc[df.preceding_word.isin(neuter), "gender"] = "neuter"
df.loc[df.preceding_word.isin(non_neuter), "gender"] = "non_neuter"
df.loc[df.preceding_word.isin(neuter + non_neuter) == 0, "gender"] = "unspecified"
g = df.groupby("node")
# Create crosstab of the node, and gender and component
freqDf = pd.concat([g["component"].value_counts().unstack(1), g["gender"].value_counts().unstack(1)])
# Reset indices, starting from 1, not the default 0!
""" Crosstabs don't come with index, so we first set the index with
`reset_index` and then alter it. """
freqDf.reset_index(inplace=True)
freqDf.index = np.arange(1, len(freqDf) + 1)
freqDf.to_csv("dataset/py-frequencies.csv", sep="\t", encoding="utf-8")
输出接近我想要的,但不完全是:
答案 0 :(得分:3)
更新:这是crosstab
:
In [11]: df1 = pd.crosstab(df['node'], df['precedingWord'])
In [12]: df1
Out[12]:
precedingWord a few some the
node
banana 2 0 0 1
coconut 0 1 1 1
In [13]: df2 = pd.crosstab(df['node'], df['comp'])
这显然更清晰(更有效的大数据算法)。
然后用一个轴= 1的concat将它们粘合起来(即添加更多列而不是添加更多行)。
In [14]: pd.concat([df1, df2], axis=1, keys=['precedingWord', 'comp'])
Out[14]:
precedingWord comp
a few some the lal lel lil
node
banana 2 0 0 1 1 2 0
coconut 0 1 1 1 1 1 1
我可能会这样离开(作为MultiIndex),如果你想让它变平,只是不要传递密钥(虽然可能存在重复单词的问题):
In [15]: pd.concat([df1, df2], axis=1)
Out[15]:
a few some the lal lel lil
node
banana 2 0 0 1 1 2 0
coconut 0 1 1 1 1 1 1
除此之外:如果concat没有要求在存在时明确地传递列名(作为关键kwarg),那将是很好的...
您可以使用value_counts
:
In [21]: g = df.groupby("node")
In [22]: g["comp"].value_counts()
Out[22]:
node comp
banana lel 2
lal 1
coconut lal 1
lel 1
lil 1
dtype: int64
In [23]: g["precedingWord"].value_counts()
Out[23]:
node precedingWord
banana a 2
the 1
coconut few 1
some 1
the 1
dtype: int64
将它放在一个框架中有点棘手:
In [24]: pd.concat([g["comp"].value_counts().unstack(1), g["precedingWord"].value_counts().unstack(1)])
Out[24]:
a few lal lel lil some the
node
banana NaN NaN 1 2 NaN NaN NaN
coconut NaN NaN 1 1 1 NaN NaN
banana 2 NaN NaN NaN NaN NaN 1
coconut NaN 1 NaN NaN NaN 1 1
In [25]: pd.concat([g["comp"].value_counts().unstack(1), g["precedingWord"].value_counts().unstack(1)]).fillna(0)
Out[25]:
a few lal lel lil some the
node
banana 0 0 1 2 0 0 0
coconut 0 0 1 1 1 0 0
banana 2 0 0 0 0 0 1
coconut 0 1 0 0 0 1 1
您可以在执行concat之前将列映射到det1,det2等,例如,如果您将映射作为字典。
In [31]: res = g["comp"].value_counts().unstack(1)
In [32]: res
Out[32]:
comp lal lel lil
node
banana 1 2 NaN
coconut 1 1 1
In [33]: res.columns = res.columns.map({"lal": "det1", "lel": "det2", "lil": "det3"}.get)
In [34]: res
Out[34]:
det1 det2 det3
node
banana 1 2 NaN
coconut 1 1 1
或者你可以使用列表理解(如果你没有dict或者有特定的标签):
In [41]: res = g["comp"].value_counts().unstack(1)
In [42]: res.columns = ['det%s' % i for i, _ in enumerate(df.columns)]
答案 1 :(得分:1)
可以将您的问题分成至少三个:
loc
在做什么?Pandas为某些操作提供了加速,所以在求助于循环之前尝试库实现(见下文)
1。使用普通熊猫:
df = pd.DataFrame({"det":["a","the","a","a","a", "the"], "word":["cat", "pet", "pet", "cat","pet", "pet"]})
"you will need a dummy variable:"
df["counts"] = 1
"you probably need to reset the index"
df_counts = df.groupby(["det","word"]).agg("count").reset_index()
# det word counts
#0 a cat 2
#1 a pet 3
#2 the pet 1
"and pivot it"
df_counts.pivot( index = "word", columns = "det", values="counts").fillna(0)
#det a the
#word
#cat 2 0
#pet 3 1
有两列的示例:
df = pd.DataFrame([['idee', 'het', 'lel', 1],
['idee', 'het', 'lel', 1],
['idee', 'de', 'lal', 1],
['functie', 'de', 'lal', 1],
['functie', 'de', 'lal', 1],
['functie', 'en', 'lil', 1],
['functie', 'de', 'lel', 1],
['functie', 'de', 'lel', 1]],
columns = ['node', 'precedingWord', 'comp', 'counts'])
df["counts"] = 1
df_counts = df.groupby(["node","precedingWord", "comp"]).agg("count").reset_index()
df_counts
# node precedingWord comp counts
#0 functie de lal 2
#1 functie de lel 1
#2 functie de lil 1
#3 functie en lil 1
#4 idee de lal 1
#5 idee het lel 2
2。使用Counter
df = pd.DataFrame({"det":["a","the","a","a","a", "a"], "word":["cat", "pet", "pet", "cat","pet", "pet"]})
acounter = Counter( (tuple(x) for x in df.as_matrix()) )
#Counter({('a', 'cat'): 2, ('a', 'pet'): 2, ('the', 'pet'): 2})
df_counts = pd.DataFrame(list(zip([y[0] for y in acounter.keys()], [y[1] for y in acounter.keys()], acounter.values())), columns=["det", "word", "counts"])
# det word counts
#0 a cat 2
#1 the pet 1
#2 a pet 3
df_counts.pivot( index = "word", columns = "det", values="counts").fillna(0)
#det a the
#word
#cat 2 0
#pet 3 1
在我的情况下,这个比纯pandas
快一点(52.6μs对每个循环92.9μs用于分组;不计算旋转)
3。据我所知,这是一个自然的语言处理问题。您可以尝试将所有数据合并为一个字符串,然后使用sklearn
中的CountVectorizer
并设置ngram_range=(1, 2)
。类似的东西:
df = pd.DataFrame({"det":["a","the","a","a","a", "a"], "word":["cat", "pet", "pet", "cat","pet", "pet"]})
from sklearn.feature_extraction.text import CountVectorizer
listofpairs = []
for _, row in df.iterrows():
listofpairs.append(" ".join(row))
countvect = CountVectorizer(ngram_range=(2,2), min_df = 0.0, token_pattern='(?u)\\b\\w+\\b')
sparse_counts = countvect.fit_transform(listofpairs)
print("* input list:\n",listofpairs)
print("* array of counts:\n",sparse_counts.toarray())
print("* vocabulary [order of columns in the sparse array]:\n",countvect.vocabulary_)
counter_keys = [x[1:] for x in sorted([ tuple([v] + k.split(" ")) for k,v in countvect.vocabulary_.items()])]
counter_values = np.sum(sparse_counts.toarray(), 0)
df_counts = pd.DataFrame([(x[0], x[1], y) for x,y in zip(counter_keys, counter_values)], columns=["det", "word", "counts"])
两种选择:
1. concat
df1.set_index( “字”)
df2.set_index( “字”)
dfout = pd.concat([df1,df2],axis = 1)
2。merge
loc
它使用两个参数为行(带一个参数)或row,column
编制索引。它使用行/列名称或布尔索引(在您的情况下为行)。
如果每个性别只有一篇文章,则可以使用直接比较而不是in
操作,这可能会加快速度:
df.loc[df.precedingWord.isin(neuter), "gender"] = "neuter"
与
indices_neutral = df["precedingWord"]=="de"
df.loc[indices, "gender"] = "neuter"
或更短但不太可读
df.loc[df["precedingWord"]=="de", "gender"] = "neuter"