Question

我有一个非常大的数据框，类似于该数据框的前四列。我正在尝试生成与“ Word”列的数据类型相对应的第五个数据类型列。

请注意，数据类型对应于“ WEEKDAY”，“ DATE”等，而不是python数据类型。

4.17.5

我有单独的函数，该函数从字符串中提取特定数据类型的所有子字符串的列表。假设Page LineNum Word Line DataType 1 1 Today Today is 5th Sept 2015 NULL 1 1 is Today is 5th Sept 2015 NULL 1 1 5th Today is 5th Sept 2015 DATE 1 1 Sept Today is 5th Sept 2015 DATE 1 1 2015 Today is 5th Sept 2015 DATE ... 1 4 Sunday Sunday will be Sept 8th WEEKDAY 1 4 will Sunday will be Sept 8th NULL 1 4 be Sunday will be Sept 8th NULL 1 4 Sept Sunday will be Sept 8th DATE 1 4 8th Sunday will be Sept 8th DATE返回第1页第1行的isit_date。

要获取DataType列，我正在使用groupby和lambda函数。页面很多，行很多。

我正在尝试使用以下代码：

['5th Sept 2015']

我可以看到很多错误。这些组不会写入初始数据框...但是，此代码中出现错误。有人可以提出正确有效的方法吗？

我正在尝试使用groupby，因为- isit_date（）函数需要花费一些时间来执行，我不想为每个组重复此操作，因为每个组的行数都是相同的。

我正在使用python 3和pandas。如果需要进一步说明我的问题，请发表评论。

这是isit_date代码

file_dataframe['DataType'] = 'NULL'
for name, groups in file_dataframe.groupby(['Page', 'LineNum']):
    list_of_dates = isit_date(str(groups['Line'][0]))
    groups['DataType'] = groups['Word'].apply(lambda x: "DATE" if x in list_of_dates else 'NULL')

Answer 1

您可以尝试创建自定义功能：

def isit_date(x):
    return ['5th Sept 2015', 'Sept 8th']

def f(x):
    #split and flatten all values to one list of words
    list_of_dates = [y for x in isit_date(str(x['Line'].iat[0])) for y in x.split()]

    x['DataType'] = x['Word'].apply(lambda x: "DATE" if x in list_of_dates else 'NULL')
    return x

df = file_dataframe.groupby(['Page', 'LineNum']).apply(f)

print (df)
   Page  LineNum    Word                     Line DataType
0     1        1   Today   Today is 5th Sept 2015     NULL
1     1        1      is   Today is 5th Sept 2015     NULL
2     1        1     5th   Today is 5th Sept 2015     DATE
3     1        1    Sept   Today is 5th Sept 2015     DATE
4     1        1    2015   Today is 5th Sept 2015     DATE
5     1        4  Sunday  Sunday will be Sept 8th     NULL
6     1        4    will  Sunday will be Sept 8th     NULL
7     1        4      be  Sunday will be Sept 8th     NULL
8     1        4    Sept  Sunday will be Sept 8th     DATE
9     1        4     8th  Sunday will be Sept 8th     DATE

Answer 2

我发现apply + lambda在字符串操作与列表理解与迭代相比通常效率低下。这是另一种方法：

# define row iterator
unique_tups = df.drop_duplicates(subset=['Page', 'LineNum']).itertuples()

# construct dictionary mapping page + line to set of dates
d = {(row['Page'], row['LineNum']): set(isit_date(str(row['Line']))) \
     for row in unique_tups}

# apply membership test in list comprehension
df['DateType'] = [row['Word'] in d[(row['Page'], row['LineNum'])] \
                  for row in df.itertuples()]

# use Pandas for Boolean mapping, which we know is efficient in Pandas
mapper = {True: 'Date', False: 'Null'}
df['DateType'] = df['DateType'].map(mapper)

Answer 3

您不必为此使用groupby。您可以这样做：

df['isit_date'] = df.Line.map(isit_date)
df['DataType'] = df.apply(lambda row: "DATE" if row.Word in row.isit_date else "NULL", axis=1)

groupby函数迭代后在数据框中创建单独的列

3 个答案: