抓取后清理.csv文件

时间:2020-10-27 13:51:34

标签: python pandas

几周前,我手动抓取了一个网站,以测试一些matplotlib地块,此后已升级为用漂亮的汤抓取该网站。我正尝试像旧的stackoverflow帖子Thre are two seperate types of data in a txt file, how do i use pandas to interate each line and add the corresponding data

一样清理数据

数据组织人们去健身房的时间,其设置方式 “ 2020年10月27日上午9:00” “名字姓氏” “名字姓氏” “名字姓”

"19:00AM on 10/27/2020"
"First Name Last Name"
"First Name Last Name"
"First Name Last Name"

为了简化数据,我想将所有这些日期转换为相应名称的另一列,

"First Name Last Name", "19:00AM on 10/27/2020"
"First Name Last Name", "19:00AM on 10/27/2020"
"First Name Last Name", "19:00AM on 10/27/2020"

这是上次有效的代码:

import re

def is_time_format(s):
    time_re = re.compile(r'\b((1[0-2]|0?[1-9]):([0-5][0-9])([AaPp][Mm]))')
    return bool(time_re.match(s))

with open("1-weak-gym.csv") as fp:
    new_lines = []
    extra_info = ''
    for line in fp:
        last_bit = line.split(' ')[-1]
        if is_time_format(last_bit):
            extra_info = line
            continue
        else:
            new_lines.append(line.rstrip() + '\t' + extra_info)

open("newOutput", 'w').writelines(new_lines)

这是我要清理的一些csv文件,

"Monday, October 26, 2020",8:00AM Until 8:50AM (50Minutes),"Name Joined Waiver
MA FName LName 10/25/2020 09:40 PM None
JB FName LName 10/26/2020 07:19 AM None
TB FName LName 10/25/2020 09:03 PM None
MB FName LName 10/25/2020 09:40 PM None
NC FName LName 10/25/2020 10:17 PM None
AC FName LName 10/25/2020 09:23 PM None
NF FName LName 10/26/2020 07:56 AM None
BG FName LName 10/25/2020 10:41 PM None
GH FName LName 10/26/2020 07:39 AM None
EH FName LName 10/25/2020 10:06 PM None
DM FName LName 10/25/2020 11:42 PM None
JM FName LName 10/25/2020 09:24 PM None
TP FName LName 10/26/2020 12:32 AM None
DS FName LName 10/25/2020 11:12 PM None
KS FName LName 10/25/2020 07:46 PM None
JW FName LName 10/25/2020 11:06 AM None"
"Monday, October 26, 2020",9:00AM Until 9:50AM (50Minutes),"Name Joined Waiver
DA FName LName 09/30/2020 07:44 AM 9/23/2020 6:06:38 PM
HB FName LName 09/30/2020 07:44 AM Manually Signed
LB FName LName 10/25/2020 08:43 PM None
VB FName LName 10/26/2020 09:25 AM None
KC FName LName 10/25/2020 07:39 PM None
DC FName LName 09/30/2020 07:44 AM 9/15/2020 8:12:32 PM
CD FName LName 09/30/2020 07:45 AM 2/2/2019 6:50:10 PM
JD FName LName 09/30/2020 07:45 AM 9/24/2020 5:51:14 PM
FL FName LName 09/30/2020 07:45 AM 8/24/2020 3:23:29 PM
MM FName LName 09/30/2020 07:44 AM 9/1/2020 2:34:04 PM
CP FName LName 09/30/2020 07:45 AM Manually Signed
KR FName LName 09/30/2020 07:45 AM 2/4/2020 4:25:40 PM
JS FName LName 09/30/2020 07:46 AM Manually Signed
TS FName LName 09/30/2020 07:45 AM 8/20/2020 9:22:49 AM
MS FName LName 09/30/2020 07:45 AM 8/19/2020 8:47:16 AM
TT FName LName 08/21/2020 09:21 PM Manually Signed
VW FName LName 10/26/2020 08:53 AM None
NW FName LName 9/30/2020 TBA Manually Signed"

每行名称后面的日期可能会造成混淆,但是那是该人注册使用健身房的时间,而不是实际对应的健身房时间。 这些列是[初始,名称,注册日期,签署的豁免],现在我想将“ 2020年10月26日,星期一”,9:00AM附加到9:50 AM(50分钟),“名称加入豁免”注册日期的下一个实例发生之前的每一行。整理好数据后,我可以进入excel并删除“名称加入豁免”。

0 个答案:

没有答案