根据系列字段的内容选择Pandas数据帧记录

时间:2013-06-30 13:39:20

标签: pandas

========================更新#2 ==================== =========================

多么美好的一天。我正在慢慢地取得进展。但是,虽然PANDAS非常快速和强大,它有一个陡峭的学习曲线,并没有很好的例子(至少我正在尝试做什么)。

最新一期是特定行:

 catfile = infile[infile['dtu_topic_split'].map(lambda x: any(targetcat in x))]

适用于IPyNotebook,但不适用于Ubuntu和python 2.7

这是Ubuntu上的错误:

    Traceback (most recent call last):
      File "scikit2.py", line 27, in <module>
        catfile = infile[infile['dtu_topic_split'].map(lambda x: any(targetcat in x))]
      File "/usr/local/lib/python2.7/dist-packages/pandas-0.11.0-py2.7-linux-x86_64.egg/pandas/core/series.py", line 2408, in map
        mapped = map_f(values, arg)
      File "inference.pyx", line 861, in pandas.lib.map_infer (pandas/lib.c:41822)
      File "scikit2.py", line 27, in <lambda>
        catfile = infile[infile['dtu_topic_split'].map(lambda x: any(targetcat in x))]
    TypeError: 'bool' object is not iterable

和工作代码+导致iPyNotebook

targetcat = 'Financial Services Industries'
#targetcat = 'Payroll & Employment Tax'
criterion = foo[foo['dtu_topic_split'].map(lambda x: any(targetcat in x))]
print criterion[['dtu_docid','dtu_topic_split']][:10]



     dtu_docid                                    dtu_topic_split
9    2010-0185                    [Financial Services Industries]
17   2010-0152  [Financial Services Industries, International ...
46   2012-1421  [Financial Services Industries, Payroll & Empl...
49   2012-1413  [Financial Services Industries, Payroll & Empl...
66   2012-1370  [Energy Taxation, Financial Services Industrie...
94   2009-1786                    [Financial Services Industries]
144  2012-1170       [Financial Services Industries, Real Estate]
163  2012-1101       [Financial Services Industries, Real Estate]
170  2009-1386                    [Financial Services Industries]
249  2012-0754  [Expatriate Taxation, Financial Services Indus...

这是iPYNotebook的python版本

print sys.version
2.7.4 (default, Apr 19 2013, 18:28:01) 
[GCC 4.7.3]

来自Ubuntu:

>>> import sys
>>> print sys.version
2.7.4 (default, Apr 19 2013, 18:28:01) 
[GCC 4.7.3]
>>> 

需要帮助。如果我使用传统处理,我相信我可以完成这个数据设置和修饰。仍在尝试PANDAS,但这是艰难的雪橇,最悲伤的部分是我甚至不确定为什么我要工作的东西,工作。这些类型的错误会带来挫败感

========================更新#1 ==================== =========================

使用第一个答案中的信息(感谢tshauck)我找到了一种方法来解决这个问题:

targetcat = 'International Taxation'
criterion = foo[foo['dtu_topic_split'].map(lambda x: any(targetcat in x))]

这将生成targetcat在dataframe.dtu_topic_split系列中的行列表。鉴于我是熊猫的新手,这是最好的处理方式。我打算为30-50个类别中的每个类别构建单独的培训模块。我不确定我是否应该以更传统的python风格迭代大约100K的记录,或者使用熊猫技术。再次提出任何替代方案或建议将非常感激。


我是Pandas的新手,并努力学习如何利用强大的功能。我昨天发布了一个策略,通过构建一个单独的数据帧来解决这个问题。阅读更多后,我不确定它是最有效的。我已经尝试了几种技术,根据数据帧的系列字段中特定值的存在,从datafarame中选择特定的行。以下是数据和我的尝试样本。

print foo[['dtu_docid','dtu_topic_split']]

/home/davidwaldrop/Dropbox/Miscelaneous/E&Y M&C Project/scikit training
   dtu_docid                                    dtu_topic_split
0  2012-1553          [Energy Taxation, State & Local Taxation]
1  2012-1552         [Legislation & Policy, Financial Services]
2  2010-0227            [Quantitative Economics and Statistics]
3  2010-0215                     [International Taxation, Asia]
4  2012-1529  [Ernst & Young Newsletters, This Week in Tax R...

这就是我现在正在做的工作,但无济于事:

targetcat = ['International Taxation']

criterion = foo['dtu_topic_split'].map(lambda x: x == targetcat)

print foo[criterion]

Empty DataFrame
Columns: [id, dtu_docid, dtu_topic, dtu_content, dtu_topic_split]
Index: []

我想要的是一个数据框,其中包含存储在字段dtu_topic_split中的系列中“国际税收”的记录,或者在上面的示例中,foo [3]中的记录的dtu_topic_split值为[International Taxation,Asia]

正如我所提到的,我真的想学习熊猫并认为它非常强大。作为一个新手,很难不仅找到一种方法来做我想做的事,而且也是一种理性的最佳方式。我的直觉告诉我这可能最好用索引来完成,但我还没有完成那个功能。任何见解都是最受欢迎的。

2 个答案:

答案 0 :(得分:2)

希望我能够很好地理解你的特定用例,以提供一个不错的答案。

给出一些数据:

data = """
dtu_docid|dtu_topic_split
9|2010-0185|['Financial Services Industries']
17|2010-0152|['Financial Services Industries', 'International']
46|2012-1421|['Financial Services Industries', 'Payroll & Employment Tax']
49|2012-1413|['Financial Services Industries', 'Payroll & Employment Tax']
66|2012-1370|['Energy Taxation', 'Financial Services Industries']
94|2009-1786|['Financial Services Industries']
144|2012-1170|['Financial Services Industries', 'Real Estate']
163|2012-1101|['Financial Services Industries', 'Real Estate']
170|2009-1386|['Financial Services Industries']
249|2012-0754|['Expatriate Taxation', 'Financial Services Industries']
""".split('\n')

考虑到这个问题:

  

“我想要的是一个包含'国际'记录的数据框   Taxation'存储在字段dtu_topic_split“

您可以将其放入DataFrame

rows = [row for row in data if len(row) > 0]

cleaned = []
for i, row in enumerate(rows):
    row = row.split('|')
    if i == 0:
        headers = row
    else:
        row = row[1:] # get rid of the index
        row[1] = eval(row[1])
        cleaned.append(row)

df = pd.DataFrame(cleaned, columns=headers)

看起来像这样:

print df
   dtu_docid                                    dtu_topic_split
0  2010-0185                    [Financial Services Industries]
1  2010-0152     [Financial Services Industries, International]
2  2012-1421  [Financial Services Industries, Payroll & Empl...
3  2012-1413  [Financial Services Industries, Payroll & Empl...
4  2012-1370   [Energy Taxation, Financial Services Industries]
5  2009-1786                    [Financial Services Industries]
6  2012-1170       [Financial Services Industries, Real Estate]
7  2012-1101       [Financial Services Industries, Real Estate]
8  2009-1386                    [Financial Services Industries]
9  2012-0754  [Expatriate Taxation, Financial Services Indus...

现在你有一个笨拙的dtu_topic_split列,它是一个python列表。处理起来有点棘手。

要选择您感兴趣的一个项目的行,您可以apply lambda个功能。例如:

print df.dtu_topic_split.apply(lambda x: 'Energy Taxation' in x)

那会给你一个布尔系列。

0    False
1    False
2    False
3    False
4     True
5    False
6    False
7    False
8    False
9    False
Name: dtu_topic_split, dtype: bool

然后您可以通过子表示法将其传递给df[...]

energy = df[df.dtu_topic_split.apply(lambda x: 'Energy Taxation' in x)]

print energy
   dtu_docid                                   dtu_topic_split
4  2012-1370  [Energy Taxation, Financial Services Industries]

另一种可能更有效的方法是将您的数据转换为long format

回到cleaned变量(列表列表),您可以编写一个“堆叠”具有多个主题的行的小函数。

def make_long(cleaned):
    lng = []
    for row in cleaned:
        # row is a list of length 2
        topics = row[1] # second item is the list of topics
        dtu_docid = row[0]
        for topic in topics:
            lng.append([dtu_docid, topic])

    return lng

在这种情况下,cleaned长10行。当您致电make_long时,最终会有17行,因为任何超过1个主题的行都会出现多次。

make_long(cleaned)
Out[208]: 
[['2010-0185', 'Financial Services Industries'],
 ['2010-0152', 'Financial Services Industries'],
 ['2010-0152', 'International'],
 ['2012-1421', 'Financial Services Industries'],
 ['2012-1421', 'Payroll & Employment Tax'],
 ['2012-1413', 'Financial Services Industries'],
 ['2012-1413', 'Payroll & Employment Tax'],
 ['2012-1370', 'Energy Taxation'],
 ['2012-1370', 'Financial Services Industries'],
 ['2009-1786', 'Financial Services Industries'],
 ['2012-1170', 'Financial Services Industries'],
 ['2012-1170', 'Real Estate'],
 ['2012-1101', 'Financial Services Industries'],
 ['2012-1101', 'Real Estate'],
 ['2009-1386', 'Financial Services Industries'],
 ['2012-0754', 'Expatriate Taxation'],
 ['2012-0754', 'Financial Services Industries']]

然后,您可以将其粘贴到数据框中,然后您应该开展业务。

lng = pd.DataFrame(make_long(cleaned),
    columns=['dtu_docid', 'dtu_topic_split'])

print lng
    dtu_docid                dtu_topic_split
0   2010-0185  Financial Services Industries
1   2010-0152  Financial Services Industries
2   2010-0152                  International
3   2012-1421  Financial Services Industries
4   2012-1421       Payroll & Employment Tax
5   2012-1413  Financial Services Industries
6   2012-1413       Payroll & Employment Tax
7   2012-1370                Energy Taxation
8   2012-1370  Financial Services Industries
9   2009-1786  Financial Services Industries
10  2012-1170  Financial Services Industries
11  2012-1170                    Real Estate
12  2012-1101  Financial Services Industries
13  2012-1101                    Real Estate
14  2009-1386  Financial Services Industries
15  2012-0754            Expatriate Taxation
16  2012-0754  Financial Services Industries

这样,您可以使用isin对象上的pd.Series方法一次按一个或多个主题选择行。

selected = ['Financial Services Industries', 'Real Estate']
print lng[lng.dtu_topic_split.isin(selected)]

    dtu_docid                dtu_topic_split
0   2010-0185  Financial Services Industries
1   2010-0152  Financial Services Industries
3   2012-1421  Financial Services Industries
5   2012-1413  Financial Services Industries
8   2012-1370  Financial Services Industries
9   2009-1786  Financial Services Industries
10  2012-1170  Financial Services Industries
11  2012-1170                    Real Estate
12  2012-1101  Financial Services Industries
13  2012-1101                    Real Estate
14  2009-1386  Financial Services Industries
16  2012-0754  Financial Services Industries

希望其中一些有用!

答案 1 :(得分:0)

这可能不是你问题的确切原因,但有一点让我感到惊讶的是你正在比较两个列表的完全相等...当(如果我理解)你要比较存在的时候targetcat中的dtu_topic_split ...我猜是主题列表。

假设情况如下:

targetcat = ['International Taxation']

criterion = foo['dtu_topic_split'].map(lambda possiblecat: \
    any([t in p for t in targetcat for p in possiblecat]))

我没有对此进行过测试,但我认为如果targetcat中的任何类别包含在possiblecat中某个类别的任何子字符串中,它都会返回true。