Question

我有一个非常大的数据框，我想从每列生成唯一值。这只是一个样本 - 总共有20多列。

          CRASH_DT        CRASH_MO_NO     CRASH_DAY_NO
          1/1/2013        01              01    
          1/1/2013        01              01
          1/5/2013        03              05

我想要的输出是这样的：

<variable = "CRASH_DT">
   <code>1/1/2013</code>
   <count>2</count>
   <code>1/5/2013</code>
   <count>1</count>
</variable>
<variable = "CRASH_MO_NO">
   <code>01</code>
   <count>2</count>
   <code>03</code>
   <count>1</count>
</variable>
<variable = "CRASH_DAY_NO">
   <code>01</code>
   <count>2</count>
   <code>05</code>
   <count>1</count>
</variable>

我一直在尝试使用。many其他questions建议使用.sum（）或.unique（）函数来解释我已经查看过的topic。< / p>

它们似乎都不适用于这个问题，并且所有人都说为了从每一列生成唯一值，您应该使用groupby函数，或者选择单个列。我有非常多的专栏（超过20个），所以仅仅通过写出df.unique [＆＃39; col1＆＃39;，＆＃39; col2＆＃将它们组合在一起真的没有意义。 39; ...＆＃39; col20＆＃39;]

我已经尝试过.unique（）,. value_counts（）和.count，但我无法弄清楚如何应用其中任何一个来跨多个列工作，而不是groupby函数或任何在上面的链接中建议。

我的问题是：如何从真正庞大的数据帧中的每个列生成唯一值的计数，最好是通过循环遍历列本身？（我很抱歉，如果这是重复的，我已经查看了很多关于这个主题的问题，虽然他们看起来也应该为我的问题工作，但我无法弄清楚如何调整它们以获得他们为我工作。）

到目前为止，这是我的代码：

import pyodbc
import pandas.io.sql

conn = pyodbc.connect('DRIVER={Microsoft Access Driver (*.mdb, *.accdb)};DBQ=C:\\Users\\<filename>.accdb')

sql_crash = "SELECT * FROM CRASH"
df_crash = pandas.io.sql.read_sql(sql_crash, conn)
df_c_head = df_crash.head()
df_c_desc = df_c_head.describe()

for k in df_c_desc:
   df_c_unique = df_c_desc[k].unique()
   print(df_c_unique.value_counts()) #Generates the error "numpy.ndarray object has no attribute .value_counts()

Answer 1

我会在每列上循环value_counts().items()：

>>> df["CRASH_DAY_NO"].value_counts()
01    2
05    1
dtype: int64
>>> df["CRASH_DAY_NO"].value_counts().items()
<zip object at 0x7fabf49f05c8>
>>> for value, count in df["CRASH_DAY_NO"].value_counts().items():
...     print(value, count)
...     
01 2
05 1

类似

def vc_xml(df):
    for col in df:
        yield '<variable = "{}">'.format(col)
        for k,v in df[col].value_counts().items():
            yield "   <code>{}</code>".format(k)
            yield "   <count>{}</count>".format(v)
        yield '</variable>'

with open("out.xml", "w") as fp:
    for line in vc_xml(df):
        fp.write(line + "\n")

给了我

<variable = "CRASH_DAY_NO">
   <code>01</code>
   <count>2</count>
   <code>05</code>
   <count>1</count>
</variable>
<variable = "CRASH_DT">
   <code>1/1/2013</code>
   <count>2</count>
   <code>1/5/2013</code>
   <count>1</count>
</variable>
<variable = "CRASH_MO_NO">
   <code>01</code>
   <count>2</count>
   <code>03</code>
   <count>1</count>
</variable>

Answer 2

这是一个受this question答案启发的答案。但我不知道它是否足够可扩展。

df = pd.DataFrame({'CRASH_DAY_NO': [1, 1, 5, 2, 2],
 'CRASH_DT': ['10/2/2014 5:00:08 PM',
  '5/28/2014 1:29:28 PM',
  '5/28/2014 1:29:28 PM',
  '7/14/2014 5:42:03 PM',
  '6/3/2014 10:33:22 AM'],
 'CRASH_ID': [1486150, 1486152, 1486224, 1486225, 1486226],
 'SEG_PT_LRS_MEAS': [79.940226960000004,
  297.80989999000002,
  140.56460290999999,
  759.43600000000004,
  102.566036],
 'SER_NO': [1, 3, 4, 5, 6]})

df = df.apply(lambda x: x.value_counts(sort=False))
df.index = df.index.astype(str)
# Transforming to XML by hand ...
def func(row):
    xml = ['<variable = "{0}">'.format(row.name)]
    for field in row.index:
        if not pd.isnull(row[field]):
            xml.append('  <code>{0}</code>'.format(field))
            xml.append('  <count>{0}</count>'.format(row[field]))
    xml.append('</variable>')
    return '\n'.join(xml)

print('\n'.join(df.apply(func, axis=0)))

<variable = "CRASH_DAY_NO">
  <code>1</code>
  <count>2.0</count>
  <code>2</code>
  <count>2.0</count>
  <code>5</code>
  <count>1.0</count>
</variable>
<variable = "CRASH_DT">
  <code>5/28/2014 1:29:28 PM</code>
  <count>2.0</count>
  <code>7/14/2014 5:42:03 PM</code>
  <count>1.0</count>
  <code>10/2/2014 5:00:08 PM</code>
  <count>1.0</count>
  <code>6/3/2014 10:33:22 AM</code>
  <count>1.0</count>
</variable>
....

熊猫：通过循环遍历每列中的唯一值？

2 个答案: