我从这个问题Df groupby set comparison得到以下代码:
import pandas as pd
wordlist = pd.read_csv('data/example.txt', sep='\r', header=None, index_col=None, names=['word'])
wordlist = wordlist.drop_duplicates(keep='first')
# wordlist['word'] = wordlist['word'].astype(str)
wordlist['split'] = ''
wordlist['anagrams'] = ''
for index, row in wordlist.iterrows() :
row['split'] = list(row['word'])
anaglist = wordlist['anagrams'] = wordlist['word'].apply(lambda x: ''.join(sorted(list(x))))
wordlist['anagrams'] = anaglist
wordlist = wordlist.drop(['split'], axis=1)
wordlist = wordlist['anagrams'].drop_duplicates(keep='first')
print(wordlist)
print(wordlist.dtypes)
我的example.txt文件中的某些输入似乎被读作int,特别是如果字符串具有不同的字符长度。我无法强迫pandas使用.astype(str)将数据视为字符串
发生了什么?
答案 0 :(得分:1)
首先强制读取列到字符串可以使用dtype=str
中的参数read_csv
,但如果必须显式转换数字列,则使用它。所以看起来因为字符串值列中的所有值都被隐式转换为str
。
我尝试了一下你的代码:
<强>设置强>:
import pandas as pd
import numpy as np
temp=u'''"acb"
"acb"
"bca"
"foo"
"oof"
"spaniel"'''
#after testing replace 'pd.compat.StringIO(temp)' to 'example.txt'
wordlist = pd.read_csv(pd.compat.StringIO(temp), sep="\r", index_col=None, names=['word'])
print (wordlist)
word
0 acb
1 acb
2 bca
3 foo
4 oof
5 spaniel
#first remove duplicates
wordlist = wordlist.drop_duplicates()
#create lists and join them
wordlist['anagrams'] = wordlist['word'].apply(lambda x: ''.join(sorted(list(x))))
print (wordlist)
word anagrams
0 acb abc
2 bca abc
3 foo foo
4 oof foo
5 spaniel aeilnps
#sort DataFrame by column anagrams
wordlist = wordlist.sort_values('anagrams')
#get first duplicated rows
wordlist1 = wordlist[wordlist['anagrams'].duplicated()]
print (wordlist1)
word anagrams
2 bca abc
4 oof foo
#get all duplicated rows
wordlist2 = wordlist[wordlist['anagrams'].duplicated(keep=False)]
print (wordlist2)
word anagrams
0 acb abc
2 bca abc
3 foo foo
4 oof foo