如何找到字符索引并删除其后的字符

时间:2019-02-12 15:10:34

标签: python pandas numpy

我正在尝试为我的项目清除csv数据,其中包含新闻和不必要的内容(例如javascript代码)。这是我们项目的数据集,我的工作是过滤它并删除不必要的字符。

我想做的是在行/列中找到字符的索引,如果有,则删除其后的字符(包括字符本身)。

我已经编写了代码来检查索引并可以替换确切的字符,但是问题是我想删除该字符之后的所有字符。

我尝试实现Pandas库来获取数据并替换确切的行。但是,从代码中可以看出,它只是将精确的char替换为空。我想找到char的索引(让我们说“ window”),并删除行内“ window” char后面的字符。

import pandas as pd
import numpy as np
import csv


pathtofile = "t1.csv"
data = pd.read_csv(pathtofile, encoding='utf-8' ,index_col=0)

print(type(data)) #which gives dataframe as output
print(data.head()) #prints out [id, contetn, date]

sub = 'window._ttzi' #its char array that i'm searching using find()
data["Indexes"]= data["contetn"].str.find(sub)
print(data) #prints the csv file with additional index

data = data.replace('window._ttzi', '')

#data.to_csv("t1edited.csv", encoding = 'utf-8')
print(data)   

2 个答案:

答案 0 :(得分:0)

就像您在评论中所说的那样,您希望从所有列中删除字符,因此您可以“简单地”遍历每一列并获取字符之后出现的所有内容。

所以实际上不是最佳化的方式可能是:

# Get a list of all df's columns
columns = df.columns
# dummy array
strings = []

# here is your character, if it is a list, you'll need to adjust the loop bellow
character = 'window._ttzi'

# looping trought each column
for column in columns:
    try:
        # appends everything that will come AFTER the character. Couldn't find a way to keep the character + what's before
        # so will fix it trough another loop later
        strings.append(df[column].str.split(character).str[1].values) # the 1 means after the character
    except AttributeError:
        # column is not string / object so ignore it
        pass

调整列表

# flatten the array of arrays
flat_list = [item for sublist in strings for item in sublist]

# removing nan values
cleaned_list = [x for x in flat_list if str(x) != 'nan']

# Remove duplicates (set())
unique_list = list(set(cleaned_list))

最后,用新值替换原始列,换句话说, 这将删除不必要的数据

# since we got everything we don't want, will go trough a loop once again but
# this time we will keep everything before the string. 
# instead of the split() you could also use the .replace(string, '') here
for column in columns:
    for string in unique_list:
        try:
            df[column] = df[column].str.split(string).str[0] # the zero means before the character
        except AttributeError:
        # column is not string / object
            pass

答案 1 :(得分:0)

我在互联网上搜索了很多东西,然后自己找到了答案。

pandas的rstip()函数解决了我所需要的。

首先:我们用pathtofile = "t1.csv" data = pd.read_csv(pathtofile, encoding='utf-8' ,index_col=0)打开文件,然后将数据文件分成几列,然后使用sub = 'window._ttzi'之类的特定字符进行剥离。因此,代码将类似于data['contetn'].str.rstrip(sub)

我仍将寻找删除不必要数据的其他方法。祝你今天愉快。