我有一个文件,其中包含我已加载的文本列。我想检查加载文本中国家/地区名称的出现位置。我已经加载了维基百科国家/地区的CSV文件,我使用以下代码来计算已加载文本中国家/地区名称的出现次数。
我的代码无效。
这是我的代码:
text = pd.read_sql(select_string, con)
text['tokenized_text'] = mail_text.apply(lambda col:nltk.word_tokenize(col['SomeText']), axis=1)
country_codes = pd.read_csv('wikipedia-iso-country-codes.csv')
ccs = set(country_codes['English short name lower case'])
count_occurrences=Counter(country for country in text['tokenized_text']if country in ccs)
答案 0 :(得分:1)
在您的原始代码中
行dic[country]= dic[country]+1
应该引发KeyError
,因为当第一次遇到某个国家/地区时,该字典中尚未显示该键。相反,您应该检查密钥是否存在,如果不存在,则将值初始化为1。
另一方面,它不会,因为检查
if country in country_codes['English short name lower case']:
为所有值产生False
:Series
个对象' __contains__
与indices instead of values一起使用。你应该检查
if country in country_codes['English short name lower case'].values:
对于一般计数任务,Python提供了collections.Counter,其行为有点像defaultdict(int)
,但具有额外的好处。它不需要手动检查密钥等。
由于您已有DataFrame
个对象,因此可以使用pandas提供的工具:
In [12]: country_codes = pd.read_csv('wikipedia-iso-country-codes.csv')
In [13]: text = pd.DataFrame({'SomeText': """Finland , Finland , Finland
...: The country where I want to be
...: Pony trekking or camping or just watch T.V.
...: Finland , Finland , Finland
...: It's the country for me
...:
...: You're so near to Russia
...: so far away from Japan
...: Quite a long way from Cairo
...: lots of miles from Vietnam
...:
...: Finland , Finland , Finland
...: The country where I want to be
...: Eating breakfast or dinner
...: or snack lunch in the hall
...: Finland , Finland , Finland
...: Finland has it all
...:
...: Read more: Monty Python - Finland Lyrics | MetroLyrics
...: """.split()})
In [14]: text[text['SomeText'].isin(
...: country_codes['English short name lower case']
...: )]['SomeText'].value_counts().to_dict()
...:
Out[14]: {'Finland': 14, 'Japan': 1}
这会找到text
的行,其中 SomeText 列的值位于{{1>}的英文短名称小写列中},计算 SomeText 的唯一值,并转换为字典。与描述性中间变量相同:
country_codes
与In [49]: where_sometext_isin_country_codes = text['SomeText'].isin(
...: country_codes['English short name lower case'])
In [50]: filtered_text = text[where_sometext_isin_country_codes]
In [51]: value_counts = filtered_text['SomeText'].value_counts()
In [52]: value_counts.to_dict()
Out[52]: {'Finland': 14, 'Japan': 1}
相同:
Counter
或简单地说:
In [23]: from collections import Counter
In [24]: dic = Counter()
...: ccs = set(country_codes['English short name lower case'])
...: for country in text['SomeText']:
...: if country in ccs:
...: dic[country] += 1
...:
In [25]: dic
Out[25]: Counter({'Finland': 14, 'Japan': 1})