如何在解析python字符串时保持重复的标点符号?

时间:2016-02-12 19:18:57

标签: regex python-2.7 parsing text punctuation

我需要处理少量文本(即python中的字符串)。

我想删除某些标点符号 (如'.', ',', ':', ';',

但保持标点符号表示情绪,如('...', '?', '??','???', '!', '!!', '!!!'

另外,我想删除无信息的单词'a', 'an', 'the'。 此外,迄今为止最大的挑战是如何解析“我”或“我们”以最终获得“我拥有”和“我们拥有”?撇号让我很难。

在python中执行此操作的最佳/最简单方法是什么?

例如:

"I've got an A mark!!! Such a relief... I should've partied more."

我想得到的结果:

['I', 'have', 'got', 'A', 'mark', '!!!', 'Such', 'relief', '...', 

'I',  'should', 'have', 'partied', 'more']

1 个答案:

答案 0 :(得分:0)

This can become complicated, depending on how much more rules you would want to apply.

You could make use of \b in your regular expressions to match the beginning or ending of a word. With this you can also isolate punctuation and check whether they are single characters in a list like #include <stdio.h> #include<stdlib.h> static int* createData() { int *test; int c; int *ptr; test = malloc(4*sizeof(int)); printf("Indtast 4 tal, 1 af gangen\n"); for (c = 0; c < 4; c++) { scanf("%d", &test[c]); } ptr=test; /* ptr=&test is not ok. ---> test is already the address you need ---> if you want, you can do this ptr=&test[0] and it's ok */ return (ptr); } static void udskriv(int* ptr) { int i; for (i=0;i<4;i++) { printf("%d\n",*ptr++); } } int main(void) { udskriv(createData()); return 0; } .

These ideas are used in this code:

[.;:]

Output:

I,have,got,A,mark,!!!,Such,relief,...,I,should,have,partied,more

See it run on eval.in