如何获得最长的DataFrame条目?

时间:2016-08-16 15:07:02

标签: python dask

我试图获取dask DataFrame中最长的条目。我尝试在dask DataFrame上调用nlargest,其中有两列,如下所示:

import dask.dataframe as dd

df = dd.read_csv("opendns-random-domains.txt", header=None, names=['domain_name'])
df['domain_length'] = df.domain_name.map(len)
print(df.head())
print(df.dtypes)
top_3 = df.nlargest(3, 'domain_length')
print(top_3.head())

文件opendns-random-domains.txt只包含很长的域名列表。这就是上面代码的输出:

                  domain_name  domain_length
0                webmagnat.ro             12
1     nickelfreesolutions.com             23
2  scheepvaarttelefoongids.nl             26
3                  tursan.net             10
4       plannersanonymous.com             21

domain_name       object
domain_length    float64
dtype: object

Traceback (most recent call last):
  File "nlargest_test.py", line 9, in <module>
    print(top_3.head())
  File "/home/work/Dokumente/ModZero/Commerzbank/DNS_und_Proxylog-Analyse/dask-log-analyzer/venv/lib/python3.5/site-packages/dask/dataframe/core.py", line 382, in head
    result = result.compute()
  File "/home/work/Dokumente/ModZero/Commerzbank/DNS_und_Proxylog-Analyse/dask-log-analyzer/venv/lib/python3.5/site-packages/dask/base.py", line 86, in compute
    return compute(self, **kwargs)[0]
  File "/home/work/Dokumente/ModZero/Commerzbank/DNS_und_Proxylog-Analyse/dask-log-analyzer/venv/lib/python3.5/site-packages/dask/base.py", line 179, in compute
    results = get(dsk, keys, **kwargs)
  File "/home/work/Dokumente/ModZero/Commerzbank/DNS_und_Proxylog-Analyse/dask-log-analyzer/venv/lib/python3.5/site-packages/dask/threaded.py", line 57, in get
    **kwargs)
  File "/home/work/Dokumente/ModZero/Commerzbank/DNS_und_Proxylog-Analyse/dask-log-analyzer/venv/lib/python3.5/site-packages/dask/async.py", line 484, in get_async
    raise(remote_exception(res, tb))
dask.async.TypeError: Cannot use method 'nlargest' with dtype object

Traceback
---------
  File "/home/work/Dokumente/ModZero/Commerzbank/DNS_und_Proxylog-Analyse/dask-log-analyzer/venv/lib/python3.5/site-packages/dask/async.py", line 267, in execute_task
    result = _execute_task(task, data)
  File "/home/work/Dokumente/ModZero/Commerzbank/DNS_und_Proxylog-Analyse/dask-log-analyzer/venv/lib/python3.5/site-packages/dask/async.py", line 249, in _execute_task
    return func(*args2)
  File "/home/work/Dokumente/ModZero/Commerzbank/DNS_und_Proxylog-Analyse/dask-log-analyzer/venv/lib/python3.5/site-packages/dask/dataframe/core.py", line 2040, in <lambda>
    f = lambda df: df.nlargest(n, columns)
  File "/home/work/Dokumente/ModZero/Commerzbank/DNS_und_Proxylog-Analyse/dask-log-analyzer/venv/lib/python3.5/site-packages/pandas/core/frame.py", line 3355, in nlargest
    return self._nsorted(columns, n, 'nlargest', keep)
  File "/home/work/Dokumente/ModZero/Commerzbank/DNS_und_Proxylog-Analyse/dask-log-analyzer/venv/lib/python3.5/site-packages/pandas/core/frame.py", line 3318, in _nsorted
    ser = getattr(self[columns[0]], method)(n, keep=keep)
  File "/home/work/Dokumente/ModZero/Commerzbank/DNS_und_Proxylog-Analyse/dask-log-analyzer/venv/lib/python3.5/site-packages/pandas/util/decorators.py", line 91, in wrapper
    return func(*args, **kwargs)
  File "/home/work/Dokumente/ModZero/Commerzbank/DNS_und_Proxylog-Analyse/dask-log-analyzer/venv/lib/python3.5/site-packages/pandas/core/series.py", line 1898, in nlargest
    return algos.select_n(self, n=n, keep=keep, method='nlargest')
  File "/home/work/Dokumente/ModZero/Commerzbank/DNS_und_Proxylog-Analyse/dask-log-analyzer/venv/lib/python3.5/site-packages/pandas/core/algorithms.py", line 559, in select_n
    raise TypeError("Cannot use method %r with dtype %s" % (method, dtype))

我很困惑,因为我在nlargest类型的列上调用了float64,但仍然收到此错误,说它无法在dtype {{1}上调用}。这也适用于熊猫。如何从DataFrame中获取最长的条目?

3 个答案:

答案 0 :(得分:0)

我试图重现你的问题,但事情很好。我可以建议您制作Minimal Complete Verifiable Example吗?

熊猫的例子

In [1]: import pandas as pd

In [2]: df = pd.DataFrame({'x': ['a', 'bb', 'ccc', 'dddd']})

In [3]: df['y'] = df.x.map(len)

In [4]: df
Out[4]: 
      x  y
0     a  1
1    bb  2
2   ccc  3
3  dddd  4

In [5]: df.nlargest(3, 'y')
Out[5]: 
      x  y
3  dddd  4
2   ccc  3
1    bb  2

Dask数据帧示例

In [1]: import pandas as pd

In [2]: df = pd.DataFrame({'x': ['a', 'bb', 'ccc', 'dddd']})

In [3]: import dask.dataframe as dd

In [4]: ddf = dd.from_pandas(df, npartitions=2)

In [5]: ddf['y'] = ddf.x.map(len)

In [6]: ddf.nlargest(3, 'y').compute()
Out[6]: 
      x  y
3  dddd  4
2   ccc  3
1    bb  2

或许,也许这只是在git master版本上工作了吗?

答案 1 :(得分:0)

显式类型转换帮助了我

df['column'].astype(str).astype(float).nlargest(5)

答案 2 :(得分:0)

您只需要使用.astype()将相应列的类型更改为int或float。

例如,在您的情况下:

top_3 = df['domain_length'].astype(float).nlargest(3)