我有两个数据框:
city_state
数据框
city state
0 huntsville alabama
1 montgomery alabama
2 birmingham alabama
3 mobile alabama
4 dothan alabama
5 chicago illinois
6 boise idaho
7 des moines iowa
和句子数据框
sentence
0 marthy was born in dothan
1 michelle reads some books at her home
2 hasan is highschool student in chicago
3 hartford of the west is the nickname of des moines
我想从称为city的句子数据框中提取新功能。如果句子中包含city
列中某个sentence
的名称,则该列city
将从city_state['city']
中提取,如果该句子中不包含某些{{1 }}的值为Null。
预期的新数据框将如下所示:
city
我已经运行了这段代码
sentence city
0 marthy was born in dothan dothan
1 michelle reads some books at her home Null
2 hasan is highschool student in chicago chicago
3 capital of dream is the motto of des moines des moines
但是这段代码的结果是这样的
sentence['city'] ={}
for city in city_state.city:
for text in sentence.sentence:
words = text.split()
for word in words:
if word == city:
sentence['city'].append(city)
break
else:
sentence['city'].append(None)
如果您具有类似情况下的特征工程经验,能否给我一些建议,以期为预期的结果编写正确的代码。
谢谢
注意: 这是错误的完整日志
ValueError: Length of values does not match length of index
答案 0 :(得分:1)
有些快速而肮脏的应用,尚未在大型数据帧上进行测试,因此请谨慎使用。 首先定义一个提取城市名称的函数:
def ex_city(col, cities):
output = []
for w in cities:
if w in col:
output.append(w)
return ','.join(output) if output else None
然后将其应用于句子数据框
city_list = city_state.city.unique().tolist()
sentence['city'] = sentence['sentence'].apply(lambda x: ex_city(x, city_list))
答案 1 :(得分:0)
类似的事情可能起作用。我会自己尝试,但我正在用手机。
sentence_cities =[]
cities = city_state.city
for text in sentence.sentence:
[sentence_cities.append(word) if word in cities else sentence_cities.append(None) for word in text.split()]
sentence['city'] = sentence_cities
答案 2 :(得分:0)
让sdf = sentence dataframe
和cdf=city_state dataframe
des moines
在执行str.split
时会出现问题,因为名称中有空格。
首先(或最后一次,需要测试)获得那个城市
sdf.loc[sdf['sentence'].str.contains('des moines'), 'city'] = 'des moines'
然后休息
def get_city(sentence, cities):
for word in sentence.split(' '):
if sentence in cities:
return word
return None
cities = cdf['city'].tolist()
sdf['city'] = sdf['sentence'].apply(lambda x: get_city(x, cities))