拆分一列字符串并用pandas计算单词数

时间:2017-12-25 17:04:37

标签: python string pandas dataframe

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>
    <groupId>org.roshan</groupId>
    <artifactId>registry</artifactId>
    <version>1.0.0-SNAPSHOT</version>
    <packaging>jar</packaging>

    <name>registry</name>
    <description></description>

    <properties>
        <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
        <java.version>1.8</java.version>
    </properties>
    <parent>
        <groupId>org.springframework.boot</groupId>
        <artifactId>spring-boot-starter-parent</artifactId>
        <version>1.5.6.RELEASE</version>
    </parent>
    <dependencyManagement>
        <dependencies>
            <dependency>
                <groupId>org.springframework.cloud</groupId>
                <artifactId>spring-cloud-dependencies</artifactId>
                <version>Dalston.SR1</version>
                <type>pom</type>
                <scope>import</scope>
            </dependency>
        </dependencies>
    </dependencyManagement>
    <dependencies>
        <dependency>
            <groupId>org.springframework.cloud</groupId>
            <artifactId>spring-cloud-starter-eureka-server</artifactId>
        </dependency>
        <dependency>
            <groupId>org.springframework.cloud</groupId>
            <artifactId>spring-cloud-starter-config</artifactId>
        </dependency>
    </dependencies>
    <build>
        <plugins>
            <plugin>
                <groupId>org.springframework.boot</groupId>
                <artifactId>spring-boot-maven-plugin</artifactId>
                <configuration>
                    <finalName>registry</finalName>
                </configuration>
            </plugin>
        </plugins>
    </build>
</project>

嗨,我有那张桌子。我想用';'拆分字符串表,并将其存储到新列。最后一栏应该是这样的

id   string   
0    31672;0           
1    31965;0
2    0;78464
3      51462
4    31931;0

如果有人知道如何用python做它会很好。

1 个答案:

答案 0 :(得分:2)

选项1
使用str.split + str.len -

的基本解决方案
df['word_count'] = df['string'].str.split(';').str.len()
df

     string  word_count
id                     
0   31672;0           2
1   31965;0           2
2   0;78464           2
3     51462           1
4   31931;0           2

选项2
使用str.count -

的聪明(高效,占用空间更少)的解决方案
df['word_count'] = df['string'].str.count(';') + 1
df

     string  word_count
id                     
0   31672;0           2
1   31965;0           2
2   0;78464           2
3     51462           1
4   31931;0           2

警告 - 即使对于空字符串,这也会将字数归为1(在这种情况下,坚持使用选项1)。

如果您希望每个单词占据一个新列,可以使用tolist快速简单地将分割加载到新数据框中,并使用concat - <将新数据框与原始数据连接起来/ p>

v = pd.DataFrame(df['string'].str.split(';').tolist())\
        .rename(columns=lambda x: x + 1)\
        .add_prefix('string_')

pd.concat([df, v], 1)

     string  word_count string_1 string_2
id                                       
0   31672;0           2    31672        0
1   31965;0           2    31965        0
2   0;78464           2        0    78464
3     51462           1    51462     None
4   31931;0           2    31931        0