Question

我有一个代码贯穿系列中的每一行/项目并将其转换为bigram / trigram。代码如下

def splitting(txt,gram=2):
    tx1 = txt.str.replace('[^\w\s]','').str.split().tolist()[0]
    if(len(tx1)==0):
        return np.nan
    txlis = [w for w in tx1 if w.lower() not in stop_wrds]
    if gram==2:
        return map(tuple,set(map(frozenset,list(nltk.bigrams(txlis)))))
    else:
        return map(tuple,set(map(frozenset,list(nltk.trigrams(txlis)))))

#pdb.set_trace()
print len(namedat)
prop_data = pd.DataFrame(namedat.apply(splitting,axis=1))

当我申请名为namedat的系列数据时，错误出现在最后一行，如下所示：

0                                       inter-burgo ansan
1                                        dogo glory condo
2                                                 w hotel
3                                      onyang grand hotel
4                                 onyang hot spring hotel
5            onyang cheil hotel (ex. onyang palace hotel)
6                springhill suites paso robles atascadero
7                            best western plus colony inn
8                                                  hesse 
9                                 ibis styles aachen city
10                              pullman aachen quellenhof
11                             mercure aachen europaplatz
12                                  leonardo hotel aachen
13                                  aquis grana cityhotel
14                                            buschhausen
...                                                   ...
[166295 rows x 1 columns]

ValueError：使用df.apply时，无法将shape（2）中的输入数组广播为shape（1）

我尝试了调试，并且txts和bigrams都是成功生成的，似乎没有问题splitting的功能。我不知道如何解决这个问题。请帮忙

完整的错误消息：

Traceback (most recent call last):
  File "data_playground.py", line 163, in <module>
    main()
  File "data_playground.py", line 156, in main
    createparams(db.hotelbeds_properties,"hotelbeds")
  File "data_playground.py", line 139, in createparams
    prop_params = analyze(prop_subdf)
  File "data_playground.py", line 110, in analyze
    prop_data = pd.DataFrame(namedat.apply(splitting,axis=1))
  File "/home/shubhang/.virtualenvs/pa/local/lib/python2.7/site-packages/pandas/core/frame.py", line 4877, in apply
    ignore_failures=ignore_failures)
  File "/home/shubhang/.virtualenvs/pa/local/lib/python2.7/site-packages/pandas/core/frame.py", line 4990, in _apply_standard
    result = self._constructor(data=results, index=index)
  File "/home/shubhang/.virtualenvs/pa/local/lib/python2.7/site-packages/pandas/core/frame.py", line 330, in __init__
    mgr = self._init_dict(data, index, columns, dtype=dtype)
  File "/home/shubhang/.virtualenvs/pa/local/lib/python2.7/site-packages/pandas/core/frame.py", line 461, in _init_dict
    return _arrays_to_mgr(arrays, data_names, index, columns, dtype=dtype)
  File "/home/shubhang/.virtualenvs/pa/local/lib/python2.7/site-packages/pandas/core/frame.py", line 6173, in _arrays_to_mgr
    return create_block_manager_from_arrays(arrays, arr_names, axes)
  File "/home/shubhang/.virtualenvs/pa/local/lib/python2.7/site-packages/pandas/core/internals.py", line 4642, in create_block_manager_from_arrays
    construction_error(len(arrays), arrays[0].shape, axes, e)
  File "/home/shubhang/.virtualenvs/pa/local/lib/python2.7/site-packages/pandas/core/internals.py", line 4604, in construction_error
    raise e
ValueError: could not broadcast input array from shape (2) into shape (1)

我的代码所做的一个例子：从上面的表格中可以看到一行，例如：

name    shaba boutique hotel
Name: 166278, dtype: object

然后返回由它制作的bigrams

[(u'shaba', u'boutique'), (u'boutique', u'hotel')]

如果我做一个简单的for循环（使用iterrows），该函数可以工作，我得到一个列表。我不明白为什么apply函数失败。

Answer 1

出现此错误的原因是df.apply（axis = 1）需要返回单个值才能生成一个序列，您可以阅读更多相关信息here。您的代码返回map（tuple（...））的结果，其形状为＆gt; 1表示任何超过两个单词的行。你可以在一个小的假数据框上试试这个，看看它是如何工作的，如下所示，

namedat_s = pd.Series(['inter-burgo ansan', 'glory condo', 'w hotel'])
namedat = pd.DataFrame(namedat_s)

...但是把'dogo'重新放入，你会再次收到错误。这是一个很好的例子，说明为什么单个长行代码并不总是有用，特别是如果你刚刚开始。

如果你试过这个，你可能会早点找到答案：

def splitting(txt,gram=2):
    tx1 = txt.str.replace('[^\w\s]','').str.split().tolist()[0]
    if(len(tx1)==0):
        return np.nan
    txlis = [w for w in tx1 if w.lower() not in stop_wrds]
    print 1, txlis
    print 2, find_ngrams(txlis,2)
    print 3, list(find_ngrams(txlis,2))
    print 4, map(frozenset,list(find_ngrams(txlis,2)))
    print 5, set(map(frozenset,list(find_ngrams(txlis,2))))
    print 6, map(tuple,set(map(frozenset,list(find_ngrams(txlis,2)))))
    print len(map(tuple,set(map(frozenset,list(find_ngrams(txlis,2))))))
    if gram==2:
        return map(tuple,set(map(frozenset,list(find_ngrams(txlis,2)))))
    else:
        return map(tuple,set(map(frozenset,list(find_ngrams(txlis,2)))))

你会看到错误发生，如你所说，不是在分裂函数中，而是在返回之后发生的事情，并且知道返回的内容会给你提供关于原因的大线索。

ValueError：使用df.apply时，无法将shape（2）中的输入数组广播为shape（1）

1 个答案: