匹配内容以创建新列

时间:2019-10-23 00:34:28

标签: python pandas function

您好,我有一个数据集,希望将关键字与位置进行匹配。我遇到的问题是我的数据集中出现的“阿富汗”,“喀布尔”或“赫尔蒙德”位置以150多种组合出现,包括拼写错误,大写字母以及城市或城镇的名称。我想做的是创建一个单独的列,如果位置中包含以下任何字符“ afg”或“ Afg”或“ kab”或“ helm”或“”,则返回值1。我不确定是否为upper或小写会有所不同。

例如,有数百种位置组合,例如:耶格达拉克,阿富汗,阿富汗,加兹尼♥,喀布尔/阿富汗,

我已经尝试过这段代码,如果它与短语完全匹配,则很好,但是变化太多,无法记下每个异常

keywords= ['Afghanistan','Kabul','Herat','Jalalabad','Kandahar','Mazar-i-Sharif', 'Kunduz', 'Lashkargah', 'mazar', 'afghanistan','kabul','herat','jalalabad','kandahar']


#how to make a column that shows rows with a certain keyword..
def keyword_solution(value):
    strings = value.split()
    if any(word in strings for word in keywords):
        return 1
    else:
        return 0

taleban_2['keyword_solution'] = taleban_2['location'].apply(keyword_solution)

# below will return the 1 values

taleban_2[taleban_2['keyword_solution'].isin(['1'])].head(5)

只需替换此逻辑即可将所有结果放入与“ Afg”或“ afg”或“ kab”或“ Kab”或“ kund”或“ Kund”匹配的“ keyword_solution”列中

1 个答案:

答案 0 :(得分:0)

给出以下内容:

  • 《纽约时报》的句子
  • 删除所有非字母数字字符
  • 将所有内容更改为小写,从而无需使用不同的单词
  • 将句子拆分为listset。由于句子很长,我使用了set
  • 根据需要添加到keywords列表中
  • 匹配两个列表中的单词
    • 'afgh' in ['afghanistan']False
    • 'afgh' in 'afghanistan'True
    • 因此,列表推导会在word_list的每个单词中搜索每个关键字。
    • [True if word in y else False for y in x for word in keywords]
    • 这可以使关键字列表更短(例如,给定afgh,不需要afghanistan
import re
import pandas as pd

keywords= ['jalalabad',
           'kunduz',
           'lashkargah',
           'mazar',
           'herat',
           'mazar',
           'afgh',
           'kab',
           'kand']

df = pd.DataFrame({'sentences': ['The Taliban have wanted the United States to pull troops out of Afghanistan Turkey has wanted the Americans out of northern Syria and North Korea has wanted them to at least stop military exercises with South Korea.',
                                 'President Trump has now to some extent at least obliged all three — but without getting much of anything in return. The self-styled dealmaker has given up the leverage of the United States’ military presence in multiple places around the world without negotiating concessions from those cheering for American forces to leave.',
                                 'For a president who has repeatedly promised to get America out of foreign wars, the decisions reflect a broader conviction that bringing troops home — or at least moving them out of hot spots — is more important than haggling for advantage. In his view, decades of overseas military adventurism has only cost the country enormous blood and treasure, and waiting for deals would prolong a national disaster.',
                                 'The top American commander in Afghanistan, Gen. Austin S. Miller, said Monday that the size of the force in the country had dropped by 2,000 over the last year, down to somewhere between 13,000 and 12,000.',
                                 '“The U.S. follows its interests everywhere, and once it doesn’t reach those interests, it leaves the area,” Khairullah Khairkhwa, a senior Taliban negotiator, said in an interview posted on the group’s website recently. “The best example of that is the abandoning of the Kurds in Syria. It’s clear the Kabul administration will face the same fate.”',
                                 'afghan']})

# substitute non-alphanumeric characters
df['sentences'] = df['sentences'].apply(lambda x: re.sub('[\W_]+', ' ', x))

# create a new column with a list of all the words
df['word_list'] = df['sentences'].apply(lambda x: set(x.lower().split()))

# check the list against the keywords
df['location'] = df.word_list.apply(lambda x: any([True if word in y else False for y in x for word in keywords]))

# final
print(df.location)

0     True
1    False
2    False
3     True
4     True
5     True
Name: location, dtype: bool