我在熊猫中有一个数据帧,其中一列包含表示为“ P1Y4M1D”之类的字符串的时间间隔。
整个CSV的示例:
oci,citing,cited,creation,timespan,journal_sc,author_sc
0200100000236252421370109080537010700020300040001-020010000073609070863016304060103630305070563074902,"10.1002/pol.1985.170230401","10.1007/978-1-4613-3575-7_2",1985-04,P2Y,no,no
...
我创建了一个解析函数,该函数接受字符串'P1Y4M1D'并返回整数。 我想知道如何使用该函数将所有列值更改为已解析的值?
def do_process_citation_data(f_path):
global my_ocan
my_ocan = pd.read_csv("citations.csv",
names=['oci', 'citing', 'cited', 'creation', 'timespan', 'journal_sc', 'author_sc'],
parse_dates=['creation', 'timespan'])
my_ocan = my_ocan.iloc[1:] # to remove the first row iloc - to select data by row numbers
my_ocan['creation'] = pd.to_datetime(my_ocan['creation'], format="%Y-%m-%d", yearfirst=True)
return my_ocan
def parse():
mydict = dict()
mydict2 = dict()
i = 1
r = 1
for x in my_ocan['oci']:
mydict[x] = str(my_ocan['timespan'][i])
i +=1
print(mydict)
for key, value in mydict.items():
is_negative = value.startswith('-')
if is_negative:
date_info = re.findall(r"P(?:(\d+)Y)?(?:(\d+)M)?(?:(\d+)D)?$", value[1:])
else:
date_info = re.findall(r"P(?:(\d+)Y)?(?:(\d+)M)?(?:(\d+)D)?$", value)
year, month, day = [int(num) if num else 0 for num in date_info[0]] if date_info else [0,0,0]
daystotal = (year * 365) + (month * 30) + day
if not is_negative:
#mydict2[key] = daystotal
return daystotal
else:
#mydict2[key] = -daystotal
return -daystotal
#print(mydict2)
#return mydict2
可能我什至不需要用新的解析值来更改整个列,最终目标是编写一个新函数,该函数返回特定年份创建的文档的['timespan']平均时间。由于我需要解析值,因此我认为更改整个列和操作新的数据框会更容易。
此外,我很好奇在不修改数据帧的情况下可以在每个['timespan']行上应用解析函数的方法,我只能假设它可能像这样,但是我没有完全了解该怎么做:
for x in my_ocan['timespan']:
x = parse(str(my_ocan['timespan'])
如何获取具有新值的列?谢谢!和平:)
答案 0 :(得分:1)
df['timespan'].apply(parse)
(如@Dan所述)应该起作用。您只需要修改解析函数即可接收字符串作为参数并在最后返回解析的字符串。像这样:
import pandas as pd
def parse_postal_code(postal_code):
# Splitting postal code and getting first letters
letters = postal_code.split('_')[0]
return letters
# Example dataframe with three columns and three rows
df = pd.DataFrame({'Age': [20, 21, 22], 'Name': ['John', 'Joe', 'Carla'], 'Postal Code': ['FF_222', 'AA_555', 'BB_111']})
# This returns a new pd.Series
print(df['Postal Code'].apply(parse_postal_code))
# Can also be assigned to another column
df['Postal Code Letter'] = df['Postal Code'].apply(parse_postal_code)
print(df['Postal Code Letter'])