从句子列中提取新功能-Python

时间:2019-12-10 03:47:20

标签: python pandas dataframe machine-learning feature-extraction

我有两个数据框:

city_state数据框

    city        state
0   huntsville  alabama
1   montgomery  alabama
2   birmingham  alabama
3   mobile      alabama
4   dothan      alabama
5   chicago     illinois
6   boise       idaho
7   des moines  iowa

和句子数据框

    sentence
0   marthy was born in dothan
1   michelle reads some books at her home
2   hasan is highschool student in chicago
3   hartford of the west is the nickname of des moines

我想从称为city的句子数据框中提取新功能。如果句子中包含city列中某个sentence的名称,则该列city将从city_state['city']中提取,如果该句子中不包含某些{{1 }}的值为Null。

预期的新数据框将如下所示:

city

我已经运行了这段代码

    sentence                                        city
0   marthy was born in dothan                       dothan
1   michelle reads some books at her home           Null
2   hasan is highschool student in chicago          chicago
3   capital of dream is the motto of des moines     des moines

但是这段代码的结果是这样的

sentence['city'] ={}

for city in city_state.city:
    for text in sentence.sentence:
        words = text.split()
        for word in words:
            if word == city:
                sentence['city'].append(city)
                break
    else:
        sentence['city'].append(None)

如果您具有类似情况下的特征工程经验,能否给我一些建议,以期为预期的结果编写正确的代码。

谢谢

注意: 这是错误的完整日志

ValueError: Length of values does not match length of index

3 个答案:

答案 0 :(得分:1)

有些快速而肮脏的应用,尚未在大型数据帧上进行测试,因此请谨慎使用。 首先定义一个提取城市名称的函数:

def ex_city(col, cities):
    output = []
    for w in cities:
        if w in col:
            output.append(w)
    return ','.join(output) if output else None

然后将其应用于句子数据框

city_list = city_state.city.unique().tolist()
sentence['city'] = sentence['sentence'].apply(lambda x: ex_city(x, city_list))

答案 1 :(得分:0)

类似的事情可能起作用。我会自己尝试,但我正在用手机。

sentence_cities =[]
cities = city_state.city

for text in sentence.sentence:
    [sentence_cities.append(word) if word in cities else sentence_cities.append(None) for word in text.split()]

sentence['city'] = sentence_cities

答案 2 :(得分:0)

sdf = sentence dataframecdf=city_state dataframe

des moines在执行str.split时会出现问题,因为名称中有空格。

首先(或最后一次,需要测试)获得那个城市

sdf.loc[sdf['sentence'].str.contains('des moines'), 'city'] = 'des moines'

然后休息

def get_city(sentence, cities):
    for word in sentence.split(' '):
        if sentence in cities:
           return word
    return None

cities = cdf['city'].tolist()
sdf['city'] = sdf['sentence'].apply(lambda x: get_city(x, cities))