Question

我正在尝试使用语法关键字减少来自大型数据集的噪音。有没有办法根据一组特定的关键字水平修剪数据集。

Input: 

id1, id2, keyword, freq, gp1, gps2 
222, 111, #paris, 100, loc1, loc2 
444, 234, have, 1000, loc3, loc4
434, 134, #USA, 30, loc5, loc6
234, 234, she, 600, loc1, loc2
523, 5234,mobile, 900, loc3, loc4

从这里开始，我需要删除have，she，and，did这些对我有用的常用关键字。我试图用这样的关键字消除整行。我试图从数据集中删除噪声以供将来分析。

使用一组选择关键字消除此类行的简单方法是什么？

感谢建议，提前谢谢!!

Answer 1

假设您有一个数据框df ... 使用isin查找哪些行包含或没有列表或一组词。然后使用布尔索引来过滤数据帧。

list_of_words = ['she', 'have', 'did', 'and']
df[~df.keyword.isin(list_of_words)]

Answer 2

我不久前做过类似的事情。令我惊喜的是Pandas和Numpy一起玩的程度，以及坚持矢量化操作时的速度。

以下示例不需要除源文件之外的任何其他文件。根据您的需要修改表格。

public void createCompositeIndex(MongoClient mongo) {       
    MongoCollection<Document> collection = mongo
        .getDatabase("db")
        .getCollection("collection");

    collection.createIndex(Indexes.ascending("field1", "field2"));

    // Or, if you need compound indexes:

   collection.createIndex(Indexes.compoundIndex(
       Indexes.descending("field1"), Indexes.ascending("field2"))
   );
}

Answer 3

鉴于数据：

void foo( ..., int exit_flag) { 
  static int * x = NULL
  if( exit_flag ) {
    /* cleanup */
    free( x );
    return;
  }
  ...
}

以下方法测试停用词是否是关键字的一部分：

df = pd.DataFrame({
    'keyword': ['#paris', 'have', '#USA', 'she', 'mobile']
})
stopwords = set(['have', 'she', 'and', 'did'])

输出：

df = df[df['keyword'].str.contains('|'.join(stopwords)) == False]

下一个方法测试停用词是否匹配（1：1）关键字：

  keyword
0  #paris
2    #USA
4  mobile

输出：

df = df.drop(df[df['keyword'].map(lambda word: word in stopwords)].index)

Answer 4

给出内存要求的新内容。我将此添加为新答案，因为旧答案仍然适用于小文件。这一行逐行读取输入文件，而不是将整个文件加载到内存中。

将程序保存到filterbigcsv.py，然后使用python filterbigcsv.py big.csv clean.csv运行以从big.csv读取并写入clean.csv。对于1.6 GB的测试文件，我的系统需要一分钟。内存使用量为3 MB。

此脚本应处理任何文件大小，您只需等待更长时间即可完成。

import sys


input_filename = sys.argv[1]
output_filename = sys.argv[2]


blacklist = set("""
have she and did
""".strip().split())


blacklist_column_index = 2 # Third column, zero indexed


with open(input_filename, "r") as fin, \
     open(output_filename, "w") as fout:
    for line in fin:
        if line.split(",")[blacklist_column_index].strip(", ") in blacklist:
            pass # Don't pass through
        else:
            fout.write(line) # Print line as it was, with its original line ending

使用pandas drop row清除噪声数据

4 个答案: