Question

我的数据集（Excel）只有一列，但信息太多了。我想根据时间戳将其转换为表格格式。我想将其转换为以下列：时间，名称，URL。我正在尝试使用Python Pandas来实现这一目标。

我正在研究这样的数据集：

6/10/2017  8:40:34 AM

James

URL:.....(multiple rows)

6/10/2017 8:45:34 AM

Jenny

URL:....

如何使用Python Pandas执行此操作？

Answer 1

一种方法是使用重塑，只要值精确到正确的顺序

df = pd.DataFrame(np.reshape(df.values,(len(df)//3,3)))
df.columns = ['Time', 'Name', 'URL']

    Time                    Name    URL
0   6/10/2017 8:40:34 AM    James   URL:.....(multi rows)
1   6/10/2017 8:45:34 AM    Jenny   URL:....

编辑：这是另一种方法

使用pd.to_datetime提取时间
使用str.contains（URL）
其他所有内容都名称
每三行分组一次以填充NaN并删除重复项。

df ['Time'] = pd.to_datetime（df ['col']，errors ='coerce'）

df.loc [df ['col']。str.contains（'URL'），'URL'] = df ['col']

df ['Name'] = df [（df ['Time']。isnull（）＆amp; df ['URL']。isnull（））]。col

df.drop（'col'，axis = 1）.groupby（df.index // 3）.ffill（）。bfill（）。drop_duplicates（）

你得到了

    Time                URL                     Name
0   2017-06-10 08:40:34 URL:.....(multi rows)   James
3   2017-06-10 08:45:34 URL:....                Jenny

Answer 2

这是一种可以提供帮助的方法。

#create the dataframe
df = pd.DataFrame({'time': ['6/10/2017 08:40:34 AM', '6/10/2017 08:45:34 AM'], 'name':['James', 'Jenny'], 'url':['www.yahoo.com', 'www.google.com']})

# Set the index of the dataframe to time
indexed_df = df.set_index('time')

# review the original dataframe
df
Out[11]: 
    name                   time             url
0  James  6/10/2017 08:40:34 AM   www.yahoo.com
1  Jenny  6/10/2017 08:45:34 AM  www.google.com

# check the newly indexed dataframe
indexed_df
Out[12]: 
                        name             url
time                                        
6/10/2017 08:40:34 AM  James   www.yahoo.com
6/10/2017 08:45:34 AM  Jenny  www.google.com

我希望这会有所帮助。这取自此文档https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.set_index.html

如何使用Python Pandas在列中提取时间戳

2 个答案: