Question

我有不想要的＆＃39;无＆＃39;在将数据列表合并到单个列数据帧时填充的值。我已经对原始数据进行了NLTK转换。

mycode的

def apwords(words):
    filtered_sentence = []
    words = word_tokenize(words)
    for w in words:
        filtered_sentence.append(w)
    return filtered_sentence
addwords = lambda x: apwords(x)
clean = data['Clean_addr'].apply(addwords)


clean =list(clean)
bigram = Phrases(clean, min_count=150, threshold=2)
bigrams = Phraser(bigram)

x=[]
for i in clean:
    x.append(bigrams[i])
y=pd.DataFrame(x)
data['Phrases_Clean_Addr']=y.apply(lambda x: ' '.join(x.astype(str)), axis=1)

清理数据输出

   [['robeco', 'des','voeux', 'rd','central','f','man','yee','building','room','central'],
 ['nikko','asset','management','hk','limi','f','man','yee','building','des','voeux','rd','central'],
 ['cfa','institute','office','f','man','yee','building','des','voeux','rd','central'],
 ['victon','registrations','ltd','room','f','regent','centre','queens','rd','central','central'],
 ['ding','fung','ltd','room','crawford','house','queens','rd','central','central'],
 ['quam','ltd','queens','rd','central','th','th','floors','china','building']
 ['f', 'des', 'voeux', 'rd', 'central'],
 ['f', 'wincome', 'centre', 'des', 'voeux', 'rd', 'central'],
 ['ags', 'f', 'chuangs', 'tower', 'connaught', 'rd', 'central']]

我的当前输出

robeco des_voeux rd central f man yee building room central None None None None None None None None None None
nikko asset management hk limi f man yee building des_voeux rd central None None None None None None None None
cfa institute office f man yee building des_voeux rd central None None None None None None None None None None
victon registrations ltd room f regent centre queens_rd central central None None None None None None None None None None
ding fung ltd room crawford house queens_rd central central None None None None None None None None None None None
quam ltd queens_rd central th th floors china building None None None None None None None None None None None
canara bank aon china bldng queens_rd centeal central None None None None None None None None None None None None
gia room f aon china building queens_rd central None None None None None None None None None None None None
zaaba capital ltd_unit b f china building queens_rd central None None None None None None None None None None None
firestar diamond hk nd_floor new henry house ice house rd None None None None None None None None None None

预期输出

所有附加到数据框的无值值都不应该

robeco des_voeux rd central f man yee building room central 
nikko asset management hk limi f man yee building des_voeux rd central 
cfa institute office f man yee building des_voeux rd central 
victon registrations ltd room f regent centre queens_rd central central 
ding fung ltd room crawford house queens_rd central central 
quam ltd queens_rd central th th floors china building

Answer 1

这是预期的行为，因为您从不相等大小的列表列表中创建了一个数据框。在您的示例中，x中列表的最大长度为13.因此，您的数据框y包含13列。对于少于13个条目的任何行的元素，将填充NA值。

要获取您要求的输出，只需将dropna添加到您的应用功能。

data['Phrases_Clean_Addr']=y.apply(lambda x: ' '.join(x.dropna().astype(str)), axis=1)

所以完整的解决方案是......

x = [['robeco', 'des','voeux', 'rd','central','f','man','yee','building','room','central'],['nikko','asset','management','hk','limi','f','man','yee','building','des','voeux','rd','central'],['cfa','institute','office','f','man','yee','building','des','voeux','rd','central'],['victon','registrations','ltd','room','f','regent','centre','queens','rd','central','central'],['ding','fung','ltd','room','crawford','house','queens','rd','central','central'],['quam','ltd','queens','rd','central','th','th','floors','china','building'],['f', 'des', 'voeux', 'rd', 'central'],['f', 'wincome', 'centre', 'des', 'voeux', 'rd', 'central'],['ags', 'f', 'chuangs', 'tower', 'connaught', 'rd', 'central']]

y = pd.DataFrame(x)

z = y.apply(lambda x: ' '.join(x.dropna().astype(str)), axis=1)

>>> z.values
   array(['robeco des voeux rd central f man yee building room central',
   'nikko asset management hk limi f man yee building des voeux rd central',
   'cfa institute office f man yee building des voeux rd central',
   'victon registrations ltd room f regent centre queens rd central central',
   'ding fung ltd room crawford house queens rd central central',
   'quam ltd queens rd central th th floors china building',
   'f des voeux rd central', 'f wincome centre des voeux rd central',
   'ags f chuangs tower connaught rd central'], dtype=object)

将列表合并到单个列数据帧时，不会填充任何值

1 个答案: