Pandas不断将字符串转换为int

时间:2018-01-18 16:09:55

标签: python string pandas

我从这个问题Df groupby set comparison得到以下代码:

   import pandas as pd

wordlist = pd.read_csv('data/example.txt', sep='\r', header=None, index_col=None, names=['word'])
wordlist = wordlist.drop_duplicates(keep='first')
# wordlist['word'] = wordlist['word'].astype(str)
wordlist['split'] = ''
wordlist['anagrams'] = ''

for index, row in wordlist.iterrows() :
    row['split'] = list(row['word'])

    anaglist = wordlist['anagrams'] = wordlist['word'].apply(lambda x: ''.join(sorted(list(x))))
    wordlist['anagrams'] = anaglist

wordlist = wordlist.drop(['split'], axis=1)

wordlist = wordlist['anagrams'].drop_duplicates(keep='first')

print(wordlist)
print(wordlist.dtypes)

我的example.txt文件中的某些输入似乎被读作int,特别是如果字符串具有不同的字符长度。我无法强迫pandas使用.astype(str)将数据视为字符串

发生了什么?

1 个答案:

答案 0 :(得分:1)

首先强制读取列到字符串可以使用dtype=str中的参数read_csv,但如果必须显式转换数字列,则使用它。所以看起来因为字符串值列中的所有值都被隐式转换为str

我尝试了一下你的代码:

<强>设置

import pandas as pd
import numpy as np

temp=u'''"acb"
"acb"
"bca"
"foo"
"oof"
"spaniel"'''
#after testing replace 'pd.compat.StringIO(temp)' to 'example.txt'
wordlist = pd.read_csv(pd.compat.StringIO(temp), sep="\r", index_col=None, names=['word'])
print (wordlist)
      word
0      acb
1      acb
2      bca
3      foo
4      oof
5  spaniel
#first remove duplicates
wordlist = wordlist.drop_duplicates()
#create lists and join them
wordlist['anagrams'] = wordlist['word'].apply(lambda x: ''.join(sorted(list(x))))

print (wordlist)
      word anagrams
0      acb      abc
2      bca      abc
3      foo      foo
4      oof      foo
5  spaniel  aeilnps

#sort DataFrame by column anagrams
wordlist = wordlist.sort_values('anagrams')
#get first duplicated rows
wordlist1 = wordlist[wordlist['anagrams'].duplicated()]
print (wordlist1)
  word anagrams
2  bca      abc
4  oof      foo

#get all duplicated rows
wordlist2 = wordlist[wordlist['anagrams'].duplicated(keep=False)]
print (wordlist2)
  word anagrams
0  acb      abc
2  bca      abc
3  foo      foo
4  oof      foo