数据集:
df['bigram'] = df['Clean_Data'].apply(lambda row: list(ngrams(word_tokenize(row), 2)))
df[:,0:1]
Id bigram
1952043 [(Swimming,Pool),(Pool,in),(in,the),(the,roof),(roof,top),
1918916 [(Luxury,Apartments),(Apartments,consisting),(consisting,11),
1645751 [(Flat,available),(available,sale),(sale,Medavakkam),
1270503 [(Toddler,Pool),(Pool,with),(with,Jogging),(Jogging,Tracks),
1495638 [(near,medavakkam),(medavakkam,junction),(junction,calm),
我有一个python文件(Categories.py),其中包含属性/ Land功能的无监督分类。
category = [('Luxury Apartments', 'IN', 'Recreation_Ammenities'),
('Swimming Pool', 'IN','Recreation_Ammenities'),
('Toddler Pool', 'IN', 'Recreation_Ammenities'),
('Jogging Tracks', 'IN', 'Recreation_Ammenities')]
Recreation = [e1 for (e1, rel, e2) in category if e2=='Recreation_Ammenities']
从bigram列和类别列表中找到匹配的单词:
tokens=pd.Series(df["bigram"])
Lid=pd.Series(df["Id"])
matches = tokens.apply(lambda x: pd.Series(x).str.extractall("|".join(["({})".format(cat) for cat in Categories.Recreation])))
运行上面的代码时,我收到此错误:
AttributeError: Can only use .str accessor with string values, which use np.object_ dtype in pandas
需要帮助。
我想要的输出是:
Id bigram Recreation_Amenities
1952043 [(Swimming,Pool),(Pool,in),(in,the),.. Swimming Pool
1918916 [(Luxury,Apartments),(Apartments,.. Luxury Apartments
1645751 [(Flat,available),(available,sale)..
1270503 [(Toddler,Pool),(Jogging,Tracks).. Toddler Pool,Jogging Tracks
1495638 [(near,medavakkam),..
答案 0 :(得分:1)
这些方面的某些内容对您有用:
<div class="row">
<div class="col.col-md-10 .col-md-offset-1">
<h1 class="headline">Pardon Our Dust! We Are Building Something Great!</h1>
</div>
</div>
每个二元组都由一个空格连接,以便可以测试该二元组是否包含在您的类别列表中(即<div class="row">
<div class="col">
</div>
<div class="">
<h1 class="headline">Pardon Our Dust! We Are Building Something Great!</h1>
</div>
<div class="col">
</div>
</div>
)。
答案 1 :(得分:1)
你可以按空格加入元组,然后使用双列表理解找到娱乐中出现的单词并应用即
df['Recreation_Amenities'] = df['bigram'].apply(lambda x : [j for j in Recreation if j in [' '.join(i) for i in x]])
让我们考虑您有一个数据框
Id bigram 0 1270503 [(Toddler, Pool), (Pool, with), (with, Jogging), (Jogging, Tracks)] 1 1952043 [(Swimming, Pool), (Pool, in), (in, the), (the, roof), (roof, top)] 2 1918916 [(Luxury, Apartments), (Apartments, consisting), (consisting, 11)] 3 1495638 [(near, medavakkam), (medavakkam, junction), (junction, calm)] 4 1645751 [(Flat, available), (available, sale), (sale, Medavakkam)]
你有名单娱乐,即
Recreation = ['Luxury Apartments', 'Swimming Pool', 'Toddler Pool', 'Jogging Tracks']
然后
df['Recreation_Amenities'] = df['bigram'].apply(lambda x : [j for j in Recreation if j in [' '.join(i) for i in x]])
输出:df['Recreation_Amenities']
0 [Toddler Pool, Jogging Tracks] 1 [Swimming Pool] 2 [Luxury Apartments] 3 [] 4 [] Name: Recreation_Amenities, dtype: object