Question

我正在尝试从列中的所有行中删除标点符号。所有这些行均包含字符串数据。我尝试了几个正则表达式，但没有用。谁能告诉我这种语法在哪里？

for i in range(0, 3847):
    #Remove punctuation
    text = re.sub(r'[^\w\s]','',dataset['abstract1'][i])

这是我得到的错误：

       4     #Remove punctuations
 ----> 5     text = re.sub('[^\w\s]','',dataset['abstract1'][i])
       6 
       7     #Convert to lowercase

 G:\Anaconda3\lib\site-packages\pandas\core\series.py in
 __getitem__(self, key)
     866         key = com.apply_if_callable(key, self)
     867         try:
 --> 868             result = self.index.get_value(self, key)
     869 
     870             if not is_scalar(result):

 G:\Anaconda3\lib\site-packages\pandas\core\indexes\base.py in get_value(self, series, key)    
    4373         try:    
    4374            return self._engine.get_value(s, k,
 -> 4375                                          tz=getattr(series.dtype, 'tz', None))    
    4376         except KeyError as e1:    
    4377             if len(self) > 0 and (self.holds_integer() or self.is_boolean()):

 pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_value()

 pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_value()

 pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

 pandas/_libs/hashtable_class_helper.pxi in
 pandas._libs.hashtable.Int64HashTable.get_item()

 pandas/_libs/hashtable_class_helper.pxi in
 pandas._libs.hashtable.Int64HashTable.get_item()

Answer 1

如果要处理pandas.DataFrame对象，则可以避免使用for-loop。而是使用pandas.Series.str.replace删除标点符号。

# sample data
dataset = pd.DataFrame({
    'abstract1': ['so,me p#nct*!&io* issues', '!@#hfd87***}}|', 't&e%s$t@']
})

    abstract1
0   so,me p#nct*!&io* issues
1   !@#hfd87***}}|
2   t&e%s$t@

dataset['punct_removed'] = dataset['abstract1'].str.replace(r'[^\w\s]', '')

    abstract1                   punct_removed
0   so,me p#nct*!&io* issues    some pnctio issues
1   !@#hfd87***}}|              hfd87
2   t&e%s$t@                    test

使用循环从字符串中删除标点符号时出现KeyError

1 个答案: