根据某些单词拆分字符串并删除Python中的某些特殊字符

时间:2016-05-26 06:07:15

标签: python regex split

我有一个带字符串的元组列表。我想根据特定的分隔符将字符串拆分成较小的字符串并删除某些字符。

item_list = [('apple OR orange AND NOT pineapple'), ((sugar and salt) or (pepper and vinegar)),..]

# this is how each of the strings inside the tuples look like
str1 = 'apple OR orange AND NOT pineapple'
str2 = '(sugar and salt) or (pepper and vinegar)'

期望的结果:

cleaned_list = [['apple', 'orange', 'pineapple'], ['sugar', 'salt', 'pepper', 'vinegar',..]

# This is how each of the list should look like after splitting
list1 = ['apple', 'orange', 'pineapple']
list2 = ['sugar', 'salt', 'pepper', 'vinegar']

这就是我试过的

# Delimiter: 'AND', 'and', 'OR', 'or', 'NOT', 'not'
# Characters to remove: '[', ']', '(', ')'

test = item_list.replace('(', '').replace(')', '')).split(' AND ')

当我想要在字符串中拆分多个分隔符时,它会变得有点棘手。有更简单的方法吗?

5 个答案:

答案 0 :(得分:1)

Python代码

您可以将re.splitstrip一起使用(假设单词中间可以有空格

>>> item_list = [('apple OR ora nge AND NOT pineapple'), ('(sugar and salt) or (pepper and vinegar)')]
>>> [[x.strip() for x in re.split(r'(?i)(?:\b(?:AND|OR|NOT)\b|[]\[()])', x) if x.strip()] for x in item_list]
[['apple', 'ora nge', 'pineapple'], ['sugar', 'salt', 'pepper', 'vinegar']]

答案 1 :(得分:1)

这是使用列表推导的另一种方法:

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <groupId>com.pp.rest</groupId>
    <artifactId>cardservice</artifactId>
    <version>0.0.1-SNAPSHOT</version>
    <packaging>war</packaging>

    <name>cardservicerest</name>
    <description>CardService REST API</description>

    <parent>
        <groupId>org.springframework.boot</groupId>
        <artifactId>spring-boot-starter-parent</artifactId>
        <version>1.3.3.RELEASE</version>
    </parent>

    <properties>
        <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
        <java.version>1.8</java.version>
        <start-class>com.pp.rest.cardservice.CardServiceApplication</start-class>
    </properties>

    <dependencies>
        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-starter-tomcat</artifactId>
            <scope>provided</scope>
        </dependency>
        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-starter-web</artifactId>
        </dependency>
        <dependency>
            <groupId>com.pp</groupId>
            <artifactId>DAO-Commons</artifactId>
            <version>0.0.1-SNAPSHOT</version>
        </dependency>
        <dependency>
            <groupId>com.pp</groupId>
            <artifactId>carddao</artifactId>
            <version>0.0.1-SNAPSHOT</version>
        </dependency>
        <dependency>
            <groupId>com.pp</groupId>
            <artifactId>cardtemplatedao</artifactId>
            <version>0.0.1-SNAPSHOT</version>
        </dependency>
        <dependency>
            <groupId>com.pp</groupId>
            <artifactId>cardlibrarydao</artifactId>
            <version>0.0.1-SNAPSHOT</version>
        </dependency>
        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-starter-test</artifactId>
            <scope>test</scope>
        </dependency>
    </dependencies>

    <build>
        <plugins>
            <plugin>
                <groupId>org.springframework.boot</groupId>
                <artifactId>spring-boot-maven-plugin</artifactId>
            </plugin>
        </plugins>
    </build>
</project>

给出结果:

item_list = [('apple OR orange AND NOT pineapple'), ('(sugar and salt) or (pepper and vinegar)')]
delimeters = ['OR','AND','NOT','and','or','not']
[[i.replace('(','').replace(')','') for i in x.split() if i not in delimeters] for x in item_list]

非常简单地关注IMO

答案 2 :(得分:0)

str1 = 'apple OR orange AND NOT pineapple'
str2 = '(sugar and salt) or (pepper and vinegar)'
def spliter(line):
    dim = ['AND', 'and', 'OR', 'or', 'NOT', 'not']
    remove = ['[', ']', '(', ')']
    for word in remove:
        line = line.replace(word,"")   
    for word in dim:
        word = " "+word+" "
        line = line.replace(word," ")
    return line.split(" ")

print spliter(str1)
print spliter(str2)

<强>输出

messi@messi-Hi-Fi-B85S3:~/Desktop/messi/soc$ python sample.py 
['apple', 'orange', 'pineapple']
['sugar', 'salt', 'pepper', 'vinegar']

答案 3 :(得分:0)

不是拆分停用词,而是拆分空格并过滤停用词。您可以使用替换从单个单词中删除标点符号。

例如:

stoplist = set(('AND', 'OR', 'NOT'))
cleaned = [s.replace('(','').replace(')','') 
           for s in item.split() 
           if s.upper() not in stoplist]

输出:

['apple', 'orange', 'pineapple']

s.upper是可选的。

答案 4 :(得分:0)

您可以使用translate删除不需要的字符,然后使用split来获取字词。一旦你有了单词,你就可以使用list comprehension来过滤掉你不想要的单词:

>>> str1 = 'apple OR orange AND NOT pineapple'
>>> str2 = '(sugar and salt) or (pepper and vinegar)'
>>> words = {'and', 'or', 'not'}
>>> chars = '()[]'
>>> [x for x in str1.translate(None, chars).split() if x.lower() not in words]
['apple', 'orange', 'pineapple']
>>> [x for x in str2.translate(None, chars).split() if x.lower() not in words]
['sugar', 'salt', 'pepper', 'vinegar']