使用r中的read.csv读入的文件中的字符列出现问题

时间:2018-04-07 20:17:12

标签: r read.csv

在网站上:

http://naturalstattrick.com/teamtable.php?season=20172018&stype=2&sit=pp&score=all&rate=n&vs=all&loc=B&gpf=82&fd=2017-10-04&td=2018-04-07

页面底部有一个下载csv的选项。我下载了csv文件并重命名为Team Season Totals - Natural Stat Trick 2007-2008 5 vs 5(Counts).csv。我还把csv文件放在我的目录中。

我使用read.csv成功读入了该文件。

teams <- read.csv(file = "Team Season Totals - Natural Stat Trick 2007-2008 5 vs 5 (Counts).csv", stringsAsFactors = FALSE)

head(teams)
  ï..                 Team GP      TOI  W  L OTL ROW   CF   CA   CF.   FF   FA   FF.   SF   SA   SF.  GF  GA   GF.  SCF  SCA  SCF. SCGF SCGA SCGF. SCSH.
1   1   Atlanta Thrashers 82 3539.050 34 40   8  25 2638 3512 42.89 2002 2717 42.42 1505 2052 42.31 125 172 42.09 1195 1500 44.34   83  126 39.71  6.95
2   2 Pittsburgh Penguins 82 3435.417 47 27   8  40 2820 3380 45.48 2192 2542 46.30 1580 1812 46.58 142 122 53.79 1343 1374 49.43  112   90 55.45  8.34
3   3   Los Angeles Kings 82 3502.333 32 43   7  27 3008 3576 45.69 2306 2787 45.28 1649 1961 45.68 137 174 44.05 1049 1286 44.93   63   80 44.06  6.01
4   4  Montreal Canadiens 82 3475.183 47 25  10  42 3089 3601 46.17 2266 2603 46.54 1617 1863 46.47 144 138 51.06 1156 1221 48.63   62   61 50.41  5.36
5   5     Edmonton Oilers 82 3442.633 41 35   6  26 2958 3424 46.35 2255 2585 46.59 1601 1830 46.66 143 166 46.28 1334 1398 48.83  104  116 47.27  7.80
6   6 Philadelphia Flyers 82 3374.800 42 29  11  39 2902 3343 46.47 2188 2505 46.62 1609 1857 46.42 125 137 47.71  919 1028 47.20   61   68 47.29  6.64
  SCSV. HDCF HDCA HDCF. HDGF HDGA HDGF. HDSH. HDSV.  SH.   SV.   PDO
1 91.60  388  468 45.33   51   82 38.35 13.14 82.48 8.31 91.62 0.999
2 93.45  503  444 53.12   79   49 61.72 15.71 88.96 8.99 93.27 1.023
3 93.78  270  356 43.13   29   36 44.62 10.74 89.89 8.31 91.13 0.994
4 95.00  271  322 45.70   25   31 44.64  9.23 90.37 8.91 92.59 1.015
5 91.70  443  452 49.50   57   61 48.31 12.87 86.50 8.93 90.93 0.999
6 93.39  257  266 49.14   24   24 50.00  9.34 90.98 7.77 92.62 1.004

我注意到的一件事是Team Column有一个重点:

teams$Team

[1] "Atlanta Thrashers"     "Pittsburgh Penguins"   "Los Angeles Kings"     "Montreal Canadiens"    "Edmonton Oilers"       "Philadelphia Flyers"  
 [7] "St Louis Blues"        "Colorado Avalanche"    "Vancouver Canucks"     "Minnesota Wild"        "Florida Panthers"      "Phoenix Coyotes"      
[13] "Tampa Bay Lightning"   "Buffalo Sabres"        "Chicago Blackhawks"    "New York Islanders"    "Nashville Predators"   "Anaheim Ducks"        
[19] "Boston Bruins"         "Ottawa Senators"       "Dallas Stars"          "Toronto Maple Leafs"   "Carolina Hurricanes"   "Columbus Blue Jackets"
[25] "New Jersey Devils"     "Calgary Flames"        "San Jose Sharks"       "New York Rangers"      "Washington Capitals"   "Detroit Red Wings"

删除重音:

teams$Team <- sub(pattern = "Â", replacement = "", teams$Team)
teams$Team[1]
[1] "Atlanta Thrashers"

现在,当我想基于Team对数据进行子集化时,所有值都会返回FALSE:

teams$Team[1]
[1] "Atlanta Thrashers"
teams$Team[1] == "Atlanta Thrashers"
[1] FALSE

dplyr::filter(teams, Team == "Atlanta Thrashers")

 [1] ï..   Team  GP    TOI   W     L     OTL   ROW   CF    CA    CF.   FF    FA    FF.   SF    SA    SF.   GF    GA    GF.   SCF   SCA   SCF.  SCGF  SCGA 
[26] SCGF. SCSH. SCSV. HDCF  HDCA  HDCF. HDGF  HDGA  HDGF. HDSH. HDSV. SH.   SV.   PDO  
<0 rows> (or 0-length row.names)

对于每支球队来说都是假的,我不明白为什么?我删除了哪些口音?它是否必须对编码做一些事情,即utf-8?如果有人能帮助我,我会很感激。谢谢。

1 个答案:

答案 0 :(得分:0)

我明白了。我不得不用口音。我用过:

iconv(teams$Team,, "UTF-8", "UTF-8",sub=' ')

iconv(teams$Team, "UTF-8", "UTF-8",sub=' ')[1] == "Atlanta Thrashers"

[1] TRUE

我从未遇到过这种情况,也没有使用编码和utf-8的经验。