Question

如何从单词（或单词序列）中去除噪音。我的意思是：'s，'re，.，?，,，;等等。换句话说，标点符号和缩写。但它只需要从左右边缘，单词中的噪音应该保留。

的示例：

Apple.            Apple
Donald Trump's    Trump
They're           They
I'm               I
¿Hablas espanol?  Hablas espanhol
$12               12
H4ck3r            H4ck3r
What's up         What's up

所以基本上删除撇号，动词缩写和标点符号，但仅限于字符串边缘（右/左）。似乎strip不适用于完整匹配，并且无法仅为边缘找到re合适的方法。

Answer 1

怎么样？

import re

strings = ['Apple.', "Trump's", "They're", "I'm", "¿Hablas", "$12", "H4ck3r"]

rx = re.compile(r'\b\w+\b')
filtered = [m.group(0) for string in strings for m in [rx.search(string)] if m]
print(filtered)

屈服

['Apple', 'Trump', 'They', 'I', 'Hablas', '12', 'H4ck3r']

不是从左边或右边吃东西，而是简单地进行单词字符的第一次匹配（即[a-zA-Z0-9_]）。

<小时/> 要“在野外”应用它，您可以先拆分句子，如下所示：

sentence = "Apple. Trump's They're I'm ¿Hablas $12 H4ck3r"

rx = re.compile(r'\b\w+\b')
filtered = [m.group(0) for string in sentence.split() for m in [rx.search(string)] if m]
print(filtered)

这显然会产生与上面相同的列表。

Answer 2

使用pandas：

import pandas as pd
s = pd.Series(['Apple.', "Trump's", "They're", "I'm", "¿Hablas", "$12", "H4ck3r"])

s.str.extract(r'(\w+)')

输出：

0     Apple
1     Trump
2      They
3         I
4    Hablas
5        12
6    H4ck3r
Name: 0, dtype: object

使用python从字符串边缘删除动词缩写和标点符号

2 个答案: