我有一个代码贯穿系列中的每一行/项目并将其转换为bigram / trigram。代码如下
def splitting(txt,gram=2):
tx1 = txt.str.replace('[^\w\s]','').str.split().tolist()[0]
if(len(tx1)==0):
return np.nan
txlis = [w for w in tx1 if w.lower() not in stop_wrds]
if gram==2:
return map(tuple,set(map(frozenset,list(nltk.bigrams(txlis)))))
else:
return map(tuple,set(map(frozenset,list(nltk.trigrams(txlis)))))
#pdb.set_trace()
print len(namedat)
prop_data = pd.DataFrame(namedat.apply(splitting,axis=1))
当我申请名为namedat
的系列数据时,错误出现在最后一行,如下所示:
0 inter-burgo ansan
1 dogo glory condo
2 w hotel
3 onyang grand hotel
4 onyang hot spring hotel
5 onyang cheil hotel (ex. onyang palace hotel)
6 springhill suites paso robles atascadero
7 best western plus colony inn
8 hesse
9 ibis styles aachen city
10 pullman aachen quellenhof
11 mercure aachen europaplatz
12 leonardo hotel aachen
13 aquis grana cityhotel
14 buschhausen
... ...
[166295 rows x 1 columns]
ValueError:使用df.apply时,无法将shape(2)中的输入数组广播为shape(1)
我尝试了调试,并且txts和bigrams都是成功生成的,似乎没有问题splitting
的功能。我不知道如何解决这个问题。请帮忙
完整的错误消息:
Traceback (most recent call last):
File "data_playground.py", line 163, in <module>
main()
File "data_playground.py", line 156, in main
createparams(db.hotelbeds_properties,"hotelbeds")
File "data_playground.py", line 139, in createparams
prop_params = analyze(prop_subdf)
File "data_playground.py", line 110, in analyze
prop_data = pd.DataFrame(namedat.apply(splitting,axis=1))
File "/home/shubhang/.virtualenvs/pa/local/lib/python2.7/site-packages/pandas/core/frame.py", line 4877, in apply
ignore_failures=ignore_failures)
File "/home/shubhang/.virtualenvs/pa/local/lib/python2.7/site-packages/pandas/core/frame.py", line 4990, in _apply_standard
result = self._constructor(data=results, index=index)
File "/home/shubhang/.virtualenvs/pa/local/lib/python2.7/site-packages/pandas/core/frame.py", line 330, in __init__
mgr = self._init_dict(data, index, columns, dtype=dtype)
File "/home/shubhang/.virtualenvs/pa/local/lib/python2.7/site-packages/pandas/core/frame.py", line 461, in _init_dict
return _arrays_to_mgr(arrays, data_names, index, columns, dtype=dtype)
File "/home/shubhang/.virtualenvs/pa/local/lib/python2.7/site-packages/pandas/core/frame.py", line 6173, in _arrays_to_mgr
return create_block_manager_from_arrays(arrays, arr_names, axes)
File "/home/shubhang/.virtualenvs/pa/local/lib/python2.7/site-packages/pandas/core/internals.py", line 4642, in create_block_manager_from_arrays
construction_error(len(arrays), arrays[0].shape, axes, e)
File "/home/shubhang/.virtualenvs/pa/local/lib/python2.7/site-packages/pandas/core/internals.py", line 4604, in construction_error
raise e
ValueError: could not broadcast input array from shape (2) into shape (1)
我的代码所做的一个例子: 从上面的表格中可以看到一行,例如:
name shaba boutique hotel
Name: 166278, dtype: object
然后返回由它制作的bigrams
[(u'shaba', u'boutique'), (u'boutique', u'hotel')]
如果我做一个简单的for循环(使用iterrows
),该函数可以工作,我得到一个列表。我不明白为什么apply函数失败。
答案 0 :(得分:0)
出现此错误的原因是df.apply(axis = 1)需要返回单个值才能生成一个序列,您可以阅读更多相关信息here。您的代码返回map(tuple(...))的结果,其形状为&gt; 1表示任何超过两个单词的行。你可以在一个小的假数据框上试试这个,看看它是如何工作的,如下所示,
namedat_s = pd.Series(['inter-burgo ansan', 'glory condo', 'w hotel'])
namedat = pd.DataFrame(namedat_s)
...但是把'dogo'重新放入,你会再次收到错误。这是一个很好的例子,说明为什么单个长行代码并不总是有用,特别是如果你刚刚开始。
如果你试过这个,你可能会早点找到答案:
def splitting(txt,gram=2):
tx1 = txt.str.replace('[^\w\s]','').str.split().tolist()[0]
if(len(tx1)==0):
return np.nan
txlis = [w for w in tx1 if w.lower() not in stop_wrds]
print 1, txlis
print 2, find_ngrams(txlis,2)
print 3, list(find_ngrams(txlis,2))
print 4, map(frozenset,list(find_ngrams(txlis,2)))
print 5, set(map(frozenset,list(find_ngrams(txlis,2))))
print 6, map(tuple,set(map(frozenset,list(find_ngrams(txlis,2)))))
print len(map(tuple,set(map(frozenset,list(find_ngrams(txlis,2))))))
if gram==2:
return map(tuple,set(map(frozenset,list(find_ngrams(txlis,2)))))
else:
return map(tuple,set(map(frozenset,list(find_ngrams(txlis,2)))))
你会看到错误发生,如你所说,不是在分裂函数中,而是在返回之后发生的事情,并且知道返回的内容会给你提供关于原因的大线索。