如何识别列表中项目对另一个列表的出现

时间:2016-09-20 08:37:24

标签: python python-3.x

我有一个文件,其中包含我已加载的文本列。我想检查加载文本中国家/地区名称的出现位置。我已经加载了维基百科国家/地区的CSV文件,我使用以下代码来计算已加载文本中国家/地区名称的出现次数。

我的代码无效。

这是我的代码: text = pd.read_sql(select_string, con) text['tokenized_text'] = mail_text.apply(lambda col:nltk.word_tokenize(col['SomeText']), axis=1) country_codes = pd.read_csv('wikipedia-iso-country-codes.csv') ccs = set(country_codes['English short name lower case']) count_occurrences=Counter(country for country in text['tokenized_text']if country in ccs)

1 个答案:

答案 0 :(得分:1)

在您的原始代码中

dic[country]= dic[country]+1

应该引发KeyError,因为当第一次遇到某个国家/地区时,该字典中尚未显示该键。相反,您应该检查密钥是否存在,如果不存在,则将值初始化为1。

另一方面,它不会,因为检查

if country in country_codes['English short name lower case']:

为所有值产生FalseSeries个对象' __contains__indices instead of values一起使用。你应该检查

if country in country_codes['English short name lower case'].values:

如果您的list of values is short

对于一般计数任务,Python提供了collections.Counter,其行为有点像defaultdict(int),但具有额外的好处。它不需要手动检查密钥等。

由于您已有DataFrame个对象,因此可以使用pandas提供的工具:

In [12]: country_codes = pd.read_csv('wikipedia-iso-country-codes.csv')

In [13]: text = pd.DataFrame({'SomeText': """Finland , Finland , Finland
    ...: The country where I want to be
    ...: Pony trekking or camping or just watch T.V.
    ...: Finland , Finland , Finland
    ...: It's the country for me
    ...: 
    ...: You're so near to Russia
    ...: so far away from Japan
    ...: Quite a long way from Cairo
    ...: lots of miles from Vietnam
    ...: 
    ...: Finland , Finland , Finland
    ...: The country where I want to be
    ...: Eating breakfast or dinner
    ...: or snack lunch in the hall
    ...: Finland , Finland , Finland
    ...: Finland has it all
    ...: 
    ...: Read more: Monty Python - Finland Lyrics | MetroLyrics
    ...: """.split()})

In [14]: text[text['SomeText'].isin(
    ...:     country_codes['English short name lower case']
    ...: )]['SomeText'].value_counts().to_dict()
    ...:
Out[14]: {'Finland': 14, 'Japan': 1}

这会找到text的行,其中 SomeText 列的值位于{{1>}的英文短名称小写列中},计算 SomeText 的唯一值,并转换为字典。与描述性中间变量相同:

country_codes

In [49]: where_sometext_isin_country_codes = text['SomeText'].isin( ...: country_codes['English short name lower case']) In [50]: filtered_text = text[where_sometext_isin_country_codes] In [51]: value_counts = filtered_text['SomeText'].value_counts() In [52]: value_counts.to_dict() Out[52]: {'Finland': 14, 'Japan': 1} 相同:

Counter

或简单地说:

In [23]: from collections import Counter

In [24]: dic = Counter()
    ...: ccs = set(country_codes['English short name lower case'])
    ...: for country in text['SomeText']:
    ...:     if country in ccs:
    ...:         dic[country] += 1
    ...: 

In [25]: dic
Out[25]: Counter({'Finland': 14, 'Japan': 1})