我有一个非常大的数据框,类似于该数据框的前四列。我正在尝试生成与“ Word”列的数据类型相对应的第五个数据类型列。
请注意,数据类型对应于“ WEEKDAY”,“ DATE”等,而不是python数据类型。
4.17.5
我有单独的函数,该函数从字符串中提取特定数据类型的所有子字符串的列表。假设Page LineNum Word Line DataType
1 1 Today Today is 5th Sept 2015 NULL
1 1 is Today is 5th Sept 2015 NULL
1 1 5th Today is 5th Sept 2015 DATE
1 1 Sept Today is 5th Sept 2015 DATE
1 1 2015 Today is 5th Sept 2015 DATE
...
1 4 Sunday Sunday will be Sept 8th WEEKDAY
1 4 will Sunday will be Sept 8th NULL
1 4 be Sunday will be Sept 8th NULL
1 4 Sept Sunday will be Sept 8th DATE
1 4 8th Sunday will be Sept 8th DATE
返回第1页第1行的isit_date
。
要获取DataType列,我正在使用groupby和lambda函数。页面很多,行很多。
我正在尝试使用以下代码:
['5th Sept 2015']
我可以看到很多错误。这些组不会写入初始数据框...但是,此代码中出现错误。有人可以提出正确有效的方法吗?
我正在尝试使用groupby,因为- isit_date()函数需要花费一些时间来执行,我不想为每个组重复此操作,因为每个组的行数都是相同的。
我正在使用python 3和pandas。 如果需要进一步说明我的问题,请发表评论。
这是isit_date代码
file_dataframe['DataType'] = 'NULL'
for name, groups in file_dataframe.groupby(['Page', 'LineNum']):
list_of_dates = isit_date(str(groups['Line'][0]))
groups['DataType'] = groups['Word'].apply(lambda x: "DATE" if x in list_of_dates else 'NULL')
答案 0 :(得分:1)
您可以尝试创建自定义功能:
def isit_date(x):
return ['5th Sept 2015', 'Sept 8th']
def f(x):
#split and flatten all values to one list of words
list_of_dates = [y for x in isit_date(str(x['Line'].iat[0])) for y in x.split()]
x['DataType'] = x['Word'].apply(lambda x: "DATE" if x in list_of_dates else 'NULL')
return x
df = file_dataframe.groupby(['Page', 'LineNum']).apply(f)
print (df)
Page LineNum Word Line DataType
0 1 1 Today Today is 5th Sept 2015 NULL
1 1 1 is Today is 5th Sept 2015 NULL
2 1 1 5th Today is 5th Sept 2015 DATE
3 1 1 Sept Today is 5th Sept 2015 DATE
4 1 1 2015 Today is 5th Sept 2015 DATE
5 1 4 Sunday Sunday will be Sept 8th NULL
6 1 4 will Sunday will be Sept 8th NULL
7 1 4 be Sunday will be Sept 8th NULL
8 1 4 Sept Sunday will be Sept 8th DATE
9 1 4 8th Sunday will be Sept 8th DATE
答案 1 :(得分:1)
我发现apply
+ lambda
在字符串操作与列表理解与迭代相比通常效率低下。这是另一种方法:
# define row iterator
unique_tups = df.drop_duplicates(subset=['Page', 'LineNum']).itertuples()
# construct dictionary mapping page + line to set of dates
d = {(row['Page'], row['LineNum']): set(isit_date(str(row['Line']))) \
for row in unique_tups}
# apply membership test in list comprehension
df['DateType'] = [row['Word'] in d[(row['Page'], row['LineNum'])] \
for row in df.itertuples()]
# use Pandas for Boolean mapping, which we know is efficient in Pandas
mapper = {True: 'Date', False: 'Null'}
df['DateType'] = df['DateType'].map(mapper)
答案 2 :(得分:0)
您不必为此使用groupby。您可以这样做:
df['isit_date'] = df.Line.map(isit_date)
df['DataType'] = df.apply(lambda row: "DATE" if row.Word in row.isit_date else "NULL", axis=1)