========================更新#2 ==================== =========================
多么美好的一天。我正在慢慢地取得进展。但是,虽然PANDAS非常快速和强大,它有一个陡峭的学习曲线,并没有很好的例子(至少我正在尝试做什么)。
最新一期是特定行:
catfile = infile[infile['dtu_topic_split'].map(lambda x: any(targetcat in x))]
适用于IPyNotebook,但不适用于Ubuntu和python 2.7
这是Ubuntu上的错误:
Traceback (most recent call last):
File "scikit2.py", line 27, in <module>
catfile = infile[infile['dtu_topic_split'].map(lambda x: any(targetcat in x))]
File "/usr/local/lib/python2.7/dist-packages/pandas-0.11.0-py2.7-linux-x86_64.egg/pandas/core/series.py", line 2408, in map
mapped = map_f(values, arg)
File "inference.pyx", line 861, in pandas.lib.map_infer (pandas/lib.c:41822)
File "scikit2.py", line 27, in <lambda>
catfile = infile[infile['dtu_topic_split'].map(lambda x: any(targetcat in x))]
TypeError: 'bool' object is not iterable
和工作代码+导致iPyNotebook
targetcat = 'Financial Services Industries'
#targetcat = 'Payroll & Employment Tax'
criterion = foo[foo['dtu_topic_split'].map(lambda x: any(targetcat in x))]
print criterion[['dtu_docid','dtu_topic_split']][:10]
dtu_docid dtu_topic_split
9 2010-0185 [Financial Services Industries]
17 2010-0152 [Financial Services Industries, International ...
46 2012-1421 [Financial Services Industries, Payroll & Empl...
49 2012-1413 [Financial Services Industries, Payroll & Empl...
66 2012-1370 [Energy Taxation, Financial Services Industrie...
94 2009-1786 [Financial Services Industries]
144 2012-1170 [Financial Services Industries, Real Estate]
163 2012-1101 [Financial Services Industries, Real Estate]
170 2009-1386 [Financial Services Industries]
249 2012-0754 [Expatriate Taxation, Financial Services Indus...
这是iPYNotebook的python版本
print sys.version
2.7.4 (default, Apr 19 2013, 18:28:01)
[GCC 4.7.3]
来自Ubuntu:
>>> import sys
>>> print sys.version
2.7.4 (default, Apr 19 2013, 18:28:01)
[GCC 4.7.3]
>>>
需要帮助。如果我使用传统处理,我相信我可以完成这个数据设置和修饰。仍在尝试PANDAS,但这是艰难的雪橇,最悲伤的部分是我甚至不确定为什么我要工作的东西,工作。这些类型的错误会带来挫败感
========================更新#1 ==================== =========================
使用第一个答案中的信息(感谢tshauck)我找到了一种方法来解决这个问题:
targetcat = 'International Taxation'
criterion = foo[foo['dtu_topic_split'].map(lambda x: any(targetcat in x))]
这将生成targetcat在dataframe.dtu_topic_split系列中的行列表。鉴于我是熊猫的新手,这是最好的处理方式。我打算为30-50个类别中的每个类别构建单独的培训模块。我不确定我是否应该以更传统的python风格迭代大约100K的记录,或者使用熊猫技术。再次提出任何替代方案或建议将非常感激。
我是Pandas的新手,并努力学习如何利用强大的功能。我昨天发布了一个策略,通过构建一个单独的数据帧来解决这个问题。阅读更多后,我不确定它是最有效的。我已经尝试了几种技术,根据数据帧的系列字段中特定值的存在,从datafarame中选择特定的行。以下是数据和我的尝试样本。
print foo[['dtu_docid','dtu_topic_split']]
/home/davidwaldrop/Dropbox/Miscelaneous/E&Y M&C Project/scikit training
dtu_docid dtu_topic_split
0 2012-1553 [Energy Taxation, State & Local Taxation]
1 2012-1552 [Legislation & Policy, Financial Services]
2 2010-0227 [Quantitative Economics and Statistics]
3 2010-0215 [International Taxation, Asia]
4 2012-1529 [Ernst & Young Newsletters, This Week in Tax R...
这就是我现在正在做的工作,但无济于事:
targetcat = ['International Taxation']
criterion = foo['dtu_topic_split'].map(lambda x: x == targetcat)
print foo[criterion]
Empty DataFrame
Columns: [id, dtu_docid, dtu_topic, dtu_content, dtu_topic_split]
Index: []
我想要的是一个数据框,其中包含存储在字段dtu_topic_split中的系列中“国际税收”的记录,或者在上面的示例中,foo [3]中的记录的dtu_topic_split值为[International Taxation,Asia]
正如我所提到的,我真的想学习熊猫并认为它非常强大。作为一个新手,很难不仅找到一种方法来做我想做的事,而且也是一种理性的最佳方式。我的直觉告诉我这可能最好用索引来完成,但我还没有完成那个功能。任何见解都是最受欢迎的。
答案 0 :(得分:2)
希望我能够很好地理解你的特定用例,以提供一个不错的答案。
给出一些数据:
data = """
dtu_docid|dtu_topic_split
9|2010-0185|['Financial Services Industries']
17|2010-0152|['Financial Services Industries', 'International']
46|2012-1421|['Financial Services Industries', 'Payroll & Employment Tax']
49|2012-1413|['Financial Services Industries', 'Payroll & Employment Tax']
66|2012-1370|['Energy Taxation', 'Financial Services Industries']
94|2009-1786|['Financial Services Industries']
144|2012-1170|['Financial Services Industries', 'Real Estate']
163|2012-1101|['Financial Services Industries', 'Real Estate']
170|2009-1386|['Financial Services Industries']
249|2012-0754|['Expatriate Taxation', 'Financial Services Industries']
""".split('\n')
考虑到这个问题:
“我想要的是一个包含'国际'记录的数据框 Taxation'存储在字段dtu_topic_split“
中
您可以将其放入DataFrame
rows = [row for row in data if len(row) > 0]
cleaned = []
for i, row in enumerate(rows):
row = row.split('|')
if i == 0:
headers = row
else:
row = row[1:] # get rid of the index
row[1] = eval(row[1])
cleaned.append(row)
df = pd.DataFrame(cleaned, columns=headers)
看起来像这样:
print df
dtu_docid dtu_topic_split
0 2010-0185 [Financial Services Industries]
1 2010-0152 [Financial Services Industries, International]
2 2012-1421 [Financial Services Industries, Payroll & Empl...
3 2012-1413 [Financial Services Industries, Payroll & Empl...
4 2012-1370 [Energy Taxation, Financial Services Industries]
5 2009-1786 [Financial Services Industries]
6 2012-1170 [Financial Services Industries, Real Estate]
7 2012-1101 [Financial Services Industries, Real Estate]
8 2009-1386 [Financial Services Industries]
9 2012-0754 [Expatriate Taxation, Financial Services Indus...
现在你有一个笨拙的dtu_topic_split
列,它是一个python列表。处理起来有点棘手。
要选择您感兴趣的一个项目的行,您可以apply
lambda
个功能。例如:
print df.dtu_topic_split.apply(lambda x: 'Energy Taxation' in x)
那会给你一个布尔系列。
0 False
1 False
2 False
3 False
4 True
5 False
6 False
7 False
8 False
9 False
Name: dtu_topic_split, dtype: bool
然后您可以通过子表示法将其传递给df[...]
。
energy = df[df.dtu_topic_split.apply(lambda x: 'Energy Taxation' in x)]
print energy
dtu_docid dtu_topic_split
4 2012-1370 [Energy Taxation, Financial Services Industries]
另一种可能更有效的方法是将您的数据转换为long format。
回到cleaned
变量(列表列表),您可以编写一个“堆叠”具有多个主题的行的小函数。
def make_long(cleaned):
lng = []
for row in cleaned:
# row is a list of length 2
topics = row[1] # second item is the list of topics
dtu_docid = row[0]
for topic in topics:
lng.append([dtu_docid, topic])
return lng
在这种情况下,cleaned
长10行。当您致电make_long
时,最终会有17行,因为任何超过1个主题的行都会出现多次。
make_long(cleaned)
Out[208]:
[['2010-0185', 'Financial Services Industries'],
['2010-0152', 'Financial Services Industries'],
['2010-0152', 'International'],
['2012-1421', 'Financial Services Industries'],
['2012-1421', 'Payroll & Employment Tax'],
['2012-1413', 'Financial Services Industries'],
['2012-1413', 'Payroll & Employment Tax'],
['2012-1370', 'Energy Taxation'],
['2012-1370', 'Financial Services Industries'],
['2009-1786', 'Financial Services Industries'],
['2012-1170', 'Financial Services Industries'],
['2012-1170', 'Real Estate'],
['2012-1101', 'Financial Services Industries'],
['2012-1101', 'Real Estate'],
['2009-1386', 'Financial Services Industries'],
['2012-0754', 'Expatriate Taxation'],
['2012-0754', 'Financial Services Industries']]
然后,您可以将其粘贴到数据框中,然后您应该开展业务。
lng = pd.DataFrame(make_long(cleaned),
columns=['dtu_docid', 'dtu_topic_split'])
print lng
dtu_docid dtu_topic_split
0 2010-0185 Financial Services Industries
1 2010-0152 Financial Services Industries
2 2010-0152 International
3 2012-1421 Financial Services Industries
4 2012-1421 Payroll & Employment Tax
5 2012-1413 Financial Services Industries
6 2012-1413 Payroll & Employment Tax
7 2012-1370 Energy Taxation
8 2012-1370 Financial Services Industries
9 2009-1786 Financial Services Industries
10 2012-1170 Financial Services Industries
11 2012-1170 Real Estate
12 2012-1101 Financial Services Industries
13 2012-1101 Real Estate
14 2009-1386 Financial Services Industries
15 2012-0754 Expatriate Taxation
16 2012-0754 Financial Services Industries
这样,您可以使用isin
对象上的pd.Series
方法一次按一个或多个主题选择行。
selected = ['Financial Services Industries', 'Real Estate']
print lng[lng.dtu_topic_split.isin(selected)]
dtu_docid dtu_topic_split
0 2010-0185 Financial Services Industries
1 2010-0152 Financial Services Industries
3 2012-1421 Financial Services Industries
5 2012-1413 Financial Services Industries
8 2012-1370 Financial Services Industries
9 2009-1786 Financial Services Industries
10 2012-1170 Financial Services Industries
11 2012-1170 Real Estate
12 2012-1101 Financial Services Industries
13 2012-1101 Real Estate
14 2009-1386 Financial Services Industries
16 2012-0754 Financial Services Industries
希望其中一些有用!
答案 1 :(得分:0)
这可能不是你问题的确切原因,但有一点让我感到惊讶的是你正在比较两个列表的完全相等...当(如果我理解)你要比较存在的时候targetcat
中的dtu_topic_split
...我猜是主题列表。
假设情况如下:
targetcat = ['International Taxation']
criterion = foo['dtu_topic_split'].map(lambda possiblecat: \
any([t in p for t in targetcat for p in possiblecat]))
我没有对此进行过测试,但我认为如果targetcat
中的任何类别包含在possiblecat
中某个类别的任何子字符串中,它都会返回true。