从文件中读取的列表中的Split \ xef \ xbb \ xbf

时间:2015-12-16 06:02:14

标签: python python-2.7 stop-words

我试着读取大数据文件.txt并拆分所有的逗号,点等,所以我用Python中的代码读取文件:

<ul>
  <li class="navBack" ng-click="navBack()"></li>
  <li ng-repeat="tab in tabs" ng-class="{active:isActiveTab(tab.url)}" ng-click="onClickTab(tab)">{{tab.title}}</li>
  <li class="navNext" ng-click="navNext()"></li>
</ul>

In Controller:

$scope.index = 0;
$scope.navBack = function(tab) {
  if($scope.index > 0)
  {
    $scope.index--;
  }
  $scope.currentTab = $scope.tabs[$scope.index].url;

}

$scope.navNext = function() {
  if( $scope.index < ($scope.tabs.length-1))
  {
    $scope.index++;
  }
  $scope.currentTab = $scope.tabs[$scope.index].url;

}

并打印file= open("file.txt","r") importantWords =[] for i in file.readlines(): line = i[:-1].split(" ") for word in line: for j in word: word = re.sub('[\!@#$%^&*-/,.;:]','',word) word.lower() if word not in stopwords.words('spanish'): importantWords.append(word) print importantWords

如何清除['\xef\xbb\xbfdataText1', 'dataText2' .. 'dataTextn']?我使用的是Python 2.7。

1 个答案:

答案 0 :(得分:4)

它是UTF-8 encoded BOM

>>> import codecs
>>> codecs.BOM_UTF8
'\xef\xbb\xbf'

您可以将codecs.openencoding='utf-8-sig'一起使用来跳过BOM序列:

with codecs.open("file.txt", "r", encoding="utf-8-sig") as f:
    for line in f:
        ...

SIDENOTE:不是使用file.readlines,而是迭代文件。如果您想要的只是遍历文件,file.readlines将创建不必要的临时列表。