字符串处理,数据处理,正则表达式

时间:2019-07-10 01:54:20

标签: python regex

我有一个300万行的.txt文件。该文件包含的数据如下所示:

# RSYNC: 0 1 1 0 512 0
#$SOA 5m localhost. hostmaster.localhost. 1906022338 1h 10m 5d 1s
# random_number_ofspaces_before_this text $TTL 60s
#more random information
:127.0.1.2:https://www.spamhaus.org/query/domain/$
test
:127.0.1.2:https://www.spamhaus.org/query/domain/$
.0-0m5tk.com
.0-1-hub.com
.zzzy1129.cn
:127.0.1.4:https://www.spamhaus.org/query/domain/$
.0-il.ml
.005verf-desj.com
.01accesfunds.com

在以上数据中,有一个代码与其下面列出的所有域关联。 我想将以上数据转换成可以加载到HiveQL / SQL中的格式。 HiveQL表应如下所示:

+--------------------+--------------+-------------+-----------------------------------------------------+
|    domain_name     | period_count | parsed_code |                      raw_code                       |
+--------------------+--------------+-------------+-----------------------------------------------------+
| test               |            0 | 127.0.1.2   |  :127.0.1.2:https://www.spamhaus.org/query/domain/$ |
| .0-0m5tk.com       |            2 | 127.0.1.2   |  :127.0.1.2:https://www.spamhaus.org/query/domain/$ |
| .0-1-hub.com       |            2 | 127.0.1.2   |  :127.0.1.2:https://www.spamhaus.org/query/domain/$ |
| .zzzy1129.cn       |            2 | 127.0.1.2   |  :127.0.1.2:https://www.spamhaus.org/query/domain/$ |
| .0-il.ml           |            2 | 127.0.1.4   |  :127.0.1.4:https://www.spamhaus.org/query/domain/$ |
| .005verf-desj.com  |            2 | 127.0.1.4   |  :127.0.1.4:https://www.spamhaus.org/query/domain/$ |
| .01accesfunds.com  |            2 | 127.0.1.4   |  :127.0.1.4:https://www.spamhaus.org/query/domain/$ |
+--------------------+--------------+-------------+-----------------------------------------------------+

请注意,我不需要任何输出中的竖线。他们只是为了使上面看起来像一张桌子

我猜想像上面那样创建HiveQL表将涉及将.txt转换为.csv或Pandas数据帧。如果创建.csv,则.csv可能类似于:

domain_name,period_count,parsed_code,raw_code
test,0,127.0.1.2,:127.0.1.2:https://www.spamhaus.org/query/domain/$
.0-0m5tk.com,2,127.0.1.2,:127.0.1.2:https://www.spamhaus.org/query/domain/$
.0-1-hub.com,2,127.0.1.2,:127.0.1.2:https://www.spamhaus.org/query/domain/$
.zzzy1129.cn,2,127.0.1.2,:127.0.1.2:https://www.spamhaus.org/query/domain/$
.0-il.ml,2,127.0.1.4,:127.0.1.4:https://www.spamhaus.org/query/domain/$
.005verf-desj.com,2,127.0.1.4,:127.0.1.4:https://www.spamhaus.org/query/domain/$
.01accesfunds.com,2,127.0.1.4,:127.0.1.4:https://www.spamhaus.org/query/domain/$

我对Python解决方案感兴趣,但对完成上述数据整理步骤所需的软件包和功能不熟悉。我正在寻找一个完整的解决方案,或编码花絮来构建自己的解决方案。我猜测将需要正则表达式来标识原始数据中的“类别”或“代码”行。它们始终以“:127.0.1”开头。我还想解析代码以创建一个parsed_code列和一个period_count列,该列计算domain_name字符串中的周期数。出于测试目的,请创建我在本文开头提供的示例数据的.txt

2 个答案:

答案 0 :(得分:1)

不管最后要如何格式化,我都认为第一步是将domain_namecode分开。那部分是纯python

rows = []
code = None
parsed_code = None
with open('input.txt', 'r') as f:
    for line in f:
        line = line.rstrip('\n')
        if line.startswith(':127'):
            code = line
            parsed_code = line.split(':')[1]
            continue
        if line.startswith('#'):
            continue
        period_count = line.count('.')                    
        rows.append((line,period_count,parsed_code, code))

仅出于说明目的,您可以使用pandas将数据很好地格式化为表,如果您希望将其通过管道传输到SQL,这可能会有所帮助,但这不是绝对必要的。在pandas中,字符串的后处理也非常简单。

import pandas as pd 
df = pd.DataFrame(rows, columns=['domain_name', 'period_count', 'parsed_code',  'raw_code'])
print (df)

打印此:

         domain_name  period_count parsed_code                                           raw_code
0               test             0   127.0.1.2  :127.0.1.2:https://www.spamhaus.org/query/doma...
1       .0-0m5tk.com             2   127.0.1.2  :127.0.1.2:https://www.spamhaus.org/query/doma...
2       .0-1-hub.com             2   127.0.1.2  :127.0.1.2:https://www.spamhaus.org/query/doma...
3       .zzzy1129.cn             2   127.0.1.2  :127.0.1.2:https://www.spamhaus.org/query/doma...
4           .0-il.ml             2   127.0.1.4  :127.0.1.4:https://www.spamhaus.org/query/doma...
5  .005verf-desj.com             2   127.0.1.4  :127.0.1.4:https://www.spamhaus.org/query/doma...
6  .01accesfunds.com             2   127.0.1.4  :127.0.1.4:https://www.spamhaus.org/query/doma...

答案 1 :(得分:0)

您可以使用Python标准库来完成所有这些操作。

afterEvaluate { Project project ->
    def blah = []
    project.tasks.all { Object object ->
        blah << object.group
    }
    println blah.unique()
}