Question

我需要处理整个2018年中每20秒测量的数据，原始文件具有以下结构：

约会时间很多垃圾

多行

再次将样品量丢弃

数据

约会时间很多垃圾

等

我想制作一个大熊猫数据框，或者每个数据块至少编码一个数据框（其大小编码为样本量），以节省测量时间。

如何忽略所有其他数据垃圾？我知道它是定期编写的（周期=样本数量），但是： -我不知道文件中有多少个字符串 -我不想在周期中使用显式方法file.getline（），因为它会无限循环地工作（尤其是在python中），并且我没有足够的计算能力来使用它

是否有任何方法可以定期跳过熊猫或其他lib中的行？还是我该怎么解决？

有一个我的数据示例：

https://drive.google.com/file/d/1OefLwpTaytL7L3WFqtnxg0mDXAljc56p/view?usp=sharing

我想获取类似于pic +附加列上带有日期时间的数据表的数据帧，并且没有技术行

Answer 1

使用itertools.islice，其中下面的N表示read every N lines

from itertools import islice

N = 3
sep = ','

with open(file_path, 'r') as f:
    lines_gen = islice(f, None, None, N)
    df = pd.DataFrame([x.strip().split(sep) for x in lines_gen])

Answer 2

我重复了三次您的数据。听起来您需要每第4行（而不是从0开始），因为那是您的数据所在。在not in的{{3}}中说。

如果可调用，则将针对行索引评估可调用函数，如果应跳过该行，则返回True，否则返回False。有效的可调用参数的示例为lambda x：[0，2]中的x。

那么如果我们将lambda传递给not in函数怎么办？这就是我在下面做的事情。我正在创建要想要保留的值的列表。并将skiprows传递给import pandas as pd # creating a list of all the 4th row indexes. If you need more than 1 million, just up the range number list_of_rows_to_keep = list(range(0,1000000))[3::4] # passing this list to the lambda function using not in. df = pd.read_csv(r'PATH_To_CSV.csv', skiprows=lambda x: x not in list_of_rows_to_keep) df.head() #output 0 data 1 data 2 data参数。用英语，跳过不是每第4行的所有行。

using System.Text.RegularExpressions;
string text = "  foo@bar.com  , baz@acme, bill@bing.co.uk ,inv liad , thing , ";
RegexOptions options = RegexOptions.IgnorePatternWhitespace | RegexOptions.ExplicitCapture;

string pattern = @"
    # Match space at the start of the line, at the end, and around commas
    ^\s+ | \s*,\s* | \s+$

    # ...Or...
    |

    # Match anything not followed by a space-surounded comma
    (?<email>
        ((?!
            \s*,\s* | \s+$
        ).)
    *)";

MatchCollection matches = Regex.Matches(text, pattern, options);

foreach (Match m in matches) {
    if (!string.IsNullOrEmpty(m.Groups["email"].Value)) {
        Console.WriteLine($"({m.Index}, {m.Length}) |{m.Value}|");
    }
}

Answer 3

只需计算文件中有多少行并将它们的列表（可能称为useless_rows）放入应该在pandas中跳过的行。read_csv（...，skiprows = useless_rows）。

我的问题是芯片行计数。有几种方法可以做到：

在Linux命令“ wc -l”上（以下是如何将其放入代码中的说明：Running "wc -l <filename>" within Python Code）
生成器。我在相关行中有一个键：在最后一列中。信息不多，但请为我解救。所以我可以用它来计数字符串，看起来是500000行，需要0.00011来计数
```
with open(filename) as f:
    for row in f:
        if '2147483647' in row:
            continue
        yield row
```

我如何定期跳过用熊猫阅读txt的行？

3 个答案: