我正在尝试为我的项目清除csv数据,其中包含新闻和不必要的内容(例如javascript代码)。这是我们项目的数据集,我的工作是过滤它并删除不必要的字符。
我想做的是在行/列中找到字符的索引,如果有,则删除其后的字符(包括字符本身)。
我已经编写了代码来检查索引并可以替换确切的字符,但是问题是我想删除该字符之后的所有字符。
我尝试实现Pandas库来获取数据并替换确切的行。但是,从代码中可以看出,它只是将精确的char替换为空。我想找到char的索引(让我们说“ window”),并删除行内“ window” char后面的字符。
import pandas as pd
import numpy as np
import csv
pathtofile = "t1.csv"
data = pd.read_csv(pathtofile, encoding='utf-8' ,index_col=0)
print(type(data)) #which gives dataframe as output
print(data.head()) #prints out [id, contetn, date]
sub = 'window._ttzi' #its char array that i'm searching using find()
data["Indexes"]= data["contetn"].str.find(sub)
print(data) #prints the csv file with additional index
data = data.replace('window._ttzi', '')
#data.to_csv("t1edited.csv", encoding = 'utf-8')
print(data)
答案 0 :(得分:0)
就像您在评论中所说的那样,您希望从所有列中删除字符,因此您可以“简单地”遍历每一列并获取字符之后出现的所有内容。
所以实际上不是最佳化的方式可能是:
# Get a list of all df's columns
columns = df.columns
# dummy array
strings = []
# here is your character, if it is a list, you'll need to adjust the loop bellow
character = 'window._ttzi'
# looping trought each column
for column in columns:
try:
# appends everything that will come AFTER the character. Couldn't find a way to keep the character + what's before
# so will fix it trough another loop later
strings.append(df[column].str.split(character).str[1].values) # the 1 means after the character
except AttributeError:
# column is not string / object so ignore it
pass
调整列表
# flatten the array of arrays
flat_list = [item for sublist in strings for item in sublist]
# removing nan values
cleaned_list = [x for x in flat_list if str(x) != 'nan']
# Remove duplicates (set())
unique_list = list(set(cleaned_list))
最后,用新值替换原始列,换句话说, 这将删除不必要的数据
# since we got everything we don't want, will go trough a loop once again but
# this time we will keep everything before the string.
# instead of the split() you could also use the .replace(string, '') here
for column in columns:
for string in unique_list:
try:
df[column] = df[column].str.split(string).str[0] # the zero means before the character
except AttributeError:
# column is not string / object
pass
答案 1 :(得分:0)
我在互联网上搜索了很多东西,然后自己找到了答案。
pandas的rstip()函数解决了我所需要的。
首先:我们用pathtofile = "t1.csv" data = pd.read_csv(pathtofile, encoding='utf-8' ,index_col=0)
打开文件,然后将数据文件分成几列,然后使用sub = 'window._ttzi'
之类的特定字符进行剥离。因此,代码将类似于data['contetn'].str.rstrip(sub)
。
我仍将寻找删除不必要数据的其他方法。祝你今天愉快。