Question

我正在尝试为我的项目清除csv数据，其中包含新闻和不必要的内容（例如javascript代码）。这是我们项目的数据集，我的工作是过滤它并删除不必要的字符。

我想做的是在行/列中找到字符的索引，如果有，则删除其后的字符（包括字符本身）。

我已经编写了代码来检查索引并可以替换确切的字符，但是问题是我想删除该字符之后的所有字符。

我尝试实现Pandas库来获取数据并替换确切的行。但是，从代码中可以看出，它只是将精确的char替换为空。我想找到char的索引（让我们说“ window”），并删除行内“ window” char后面的字符。

import pandas as pd
import numpy as np
import csv


pathtofile = "t1.csv"
data = pd.read_csv(pathtofile, encoding='utf-8' ,index_col=0)

print(type(data)) #which gives dataframe as output
print(data.head()) #prints out [id, contetn, date]

sub = 'window._ttzi' #its char array that i'm searching using find()
data["Indexes"]= data["contetn"].str.find(sub)
print(data) #prints the csv file with additional index

data = data.replace('window._ttzi', '')

#data.to_csv("t1edited.csv", encoding = 'utf-8')
print(data)

Answer 1

就像您在评论中所说的那样，您希望从所有列中删除字符，因此您可以“简单地”遍历每一列并获取字符之后出现的所有内容。

所以实际上不是最佳化的方式可能是：

# Get a list of all df's columns
columns = df.columns
# dummy array
strings = []

# here is your character, if it is a list, you'll need to adjust the loop bellow
character = 'window._ttzi'

# looping trought each column
for column in columns:
    try:
        # appends everything that will come AFTER the character. Couldn't find a way to keep the character + what's before
        # so will fix it trough another loop later
        strings.append(df[column].str.split(character).str[1].values) # the 1 means after the character
    except AttributeError:
        # column is not string / object so ignore it
        pass

调整列表

# flatten the array of arrays
flat_list = [item for sublist in strings for item in sublist]

# removing nan values
cleaned_list = [x for x in flat_list if str(x) != 'nan']

# Remove duplicates (set())
unique_list = list(set(cleaned_list))

最后，用新值替换原始列，换句话说， 这将删除不必要的数据

# since we got everything we don't want, will go trough a loop once again but
# this time we will keep everything before the string. 
# instead of the split() you could also use the .replace(string, '') here
for column in columns:
    for string in unique_list:
        try:
            df[column] = df[column].str.split(string).str[0] # the zero means before the character
        except AttributeError:
        # column is not string / object
            pass

Answer 2

我在互联网上搜索了很多东西，然后自己找到了答案。

pandas的rstip（）函数解决了我所需要的。

首先：我们用pathtofile = "t1.csv" data = pd.read_csv(pathtofile, encoding='utf-8' ,index_col=0)打开文件，然后将数据文件分成几列，然后使用sub = 'window._ttzi'之类的特定字符进行剥离。因此，代码将类似于data['contetn'].str.rstrip(sub)。

我仍将寻找删除不必要数据的其他方法。祝你今天愉快。

如何找到字符索引并删除其后的字符

2 个答案: