Question

我有一个文件，其中包含我已加载的文本列。我想检查加载文本中国家/地区名称的出现位置。我已经加载了维基百科国家/地区的CSV文件，我使用以下代码来计算已加载文本中国家/地区名称的出现次数。

我的代码无效。

这是我的代码： text = pd.read_sql(select_string, con) text['tokenized_text'] = mail_text.apply(lambda col:nltk.word_tokenize(col['SomeText']), axis=1) country_codes = pd.read_csv('wikipedia-iso-country-codes.csv') ccs = set(country_codes['English short name lower case']) count_occurrences=Counter(country for country in text['tokenized_text']if country in ccs)

Answer 1

在您的原始代码中

行

dic[country]= dic[country]+1

应该引发KeyError，因为当第一次遇到某个国家/地区时，该字典中尚未显示该键。相反，您应该检查密钥是否存在，如果不存在，则将值初始化为1。

另一方面，它不会，因为检查

if country in country_codes['English short name lower case']:

为所有值产生False：Series个对象＆＃39; __contains__与indices instead of values一起使用。你应该检查

if country in country_codes['English short name lower case'].values:

如果您的list of values is short。

对于一般计数任务，Python提供了collections.Counter，其行为有点像defaultdict(int)，但具有额外的好处。它不需要手动检查密钥等。

由于您已有DataFrame个对象，因此可以使用pandas提供的工具：

In [12]: country_codes = pd.read_csv('wikipedia-iso-country-codes.csv')

In [13]: text = pd.DataFrame({'SomeText': """Finland , Finland , Finland
    ...: The country where I want to be
    ...: Pony trekking or camping or just watch T.V.
    ...: Finland , Finland , Finland
    ...: It's the country for me
    ...: 
    ...: You're so near to Russia
    ...: so far away from Japan
    ...: Quite a long way from Cairo
    ...: lots of miles from Vietnam
    ...: 
    ...: Finland , Finland , Finland
    ...: The country where I want to be
    ...: Eating breakfast or dinner
    ...: or snack lunch in the hall
    ...: Finland , Finland , Finland
    ...: Finland has it all
    ...: 
    ...: Read more: Monty Python - Finland Lyrics | MetroLyrics
    ...: """.split()})

In [14]: text[text['SomeText'].isin(
    ...:     country_codes['English short name lower case']
    ...: )]['SomeText'].value_counts().to_dict()
    ...:
Out[14]: {'Finland': 14, 'Japan': 1}

这会找到text的行，其中 SomeText 列的值位于{{1>}的英文短名称小写列中}，计算 SomeText 的唯一值，并转换为字典。与描述性中间变量相同：

country_codes

与In [49]: where_sometext_isin_country_codes = text['SomeText'].isin( ...: country_codes['English short name lower case']) In [50]: filtered_text = text[where_sometext_isin_country_codes] In [51]: value_counts = filtered_text['SomeText'].value_counts() In [52]: value_counts.to_dict() Out[52]: {'Finland': 14, 'Japan': 1}相同：

Counter

或简单地说：

In [23]: from collections import Counter

In [24]: dic = Counter()
    ...: ccs = set(country_codes['English short name lower case'])
    ...: for country in text['SomeText']:
    ...:     if country in ccs:
    ...:         dic[country] += 1
    ...: 

In [25]: dic
Out[25]: Counter({'Finland': 14, 'Japan': 1})

如何识别列表中项目对另一个列表的出现

1 个答案: