我有一个像这样的pandas数据框:
ID1 ID2 Len1 Date1 Type1 Len2 Date2 Type2 Len_Diff Date_Diff Score
123 456 1-Apr M 6-Apr L
234 567 20-Apr S 19-Apr S
345 678 10-Apr M 1-Jan M
我想通过从数据集中计算它们来填充Len1,Len2,Len_Diff和Date_Diff的列。每个ID对应一个文本文件,其文本可以使用get_text
函数检索,并且可以计算该文本的长度
截至目前,我的代码可以为每个列单独执行此操作:
def len_text(key):
text = get_text(key)
return len(text)
df['Len1'] = df['ID1'].map(len_text)
df['Len2'] = df['ID2'].map(len_text)
df['Len_Diff'] = (abs(df['Len1'] - df['Len2']))
df['Date_Diff'] = (abs(df['Date1'] - df['Date2']))
df['Same_Type'] = np.where(df['Type1']==df['Type2'],1,0)
如何在一个步骤中将所有这些列添加到数据框中。我想在一步中找到它们,因为我想将代码包装在try / except块中,以克服因解码文本失败而导致的值错误。
try:
<code to add all five columns at once>
except ValueError:
print "Failed to decode"
在上面的每一行添加一个try / except块会让它变得难看 还有其他问题,例如:Changing certain values in multiple columns of a pandas DataFrame at once,它处理多个列,但它们都在讨论影响多个列的一个计算/更改。我想要的是添加不同列的不同计算。
更新:从下面给出的答案中,我尝试了两种不同的方法来解决问题,到目前为止部分运气。这就是我做的:
方法1:
# Add calculated columns Len1, Len2, Len_Diff, Date_Diff and Same_Type
def len_text(key):
try:
text = get_text(key)
return len(text)
except (requests.exceptions.ConnectionError, requests.exceptions.HTTPError, requests.exceptions.Timeout, ValueError) as e:
return 0
df.loc[:, ['Len1','Len2','Len_Diff','Date_Diff','Same_Type']] = pd.DataFrame([
df['ID1'].map(len_text),
df['ID2'].map(len_text),
np.abs(df['ID1'].map(len_text) - df['ID2'].map(len_text)),
np.abs(df['Date1']- df['Date2'])
np.where(df['Type1']==df['Type2'],1,0)
])
print df.info()
结果1 :
<class 'pandas.core.frame.DataFrame'> RangeIndex: 570 entries, 0 to 569 df columns (total 10 columns): ID1 570 non-null int64 Date1 570 non-null int64 Type1 566 non-null object Len1 0 non-null float64 ID2 570 non-null int64 Date2 570 non-null int64 Type2 570 non-null object Len2 0 non-null float64 Date_Diff 0 non-null float64 Len_Diff 0 non-null float64 dtypes: float64(4), int64(4), object(2) memory usage: 58.0+ KB None
Approach2:
def len_text(col):
try:
return col.map(get_text).str.len()
except (requests.exceptions.ConnectionError, requests.exceptions.HTTPError, requests.exceptions.Timeout, ValueError) as e:
return 0
formulas = """
Len1 = @len_text(ID1)
Len2 = @len_text(ID2)
Len_Diff = Len1 - Len2
Len_Diff = Len_Diff.abs()
Same_Type = (Type1 == Type2) * 1
"""
try:
df.eval(formulas, inplace=True, engine='python')
except (requests.exceptions.ConnectionError, requests.exceptions.HTTPError, requests.exceptions.Timeout, ValueError) as e:
print e
print df.info()
结果2:
"__pd_eval_local_len_text" is not a supported function <class 'pandas.core.frame.DataFrame'> RangeIndex: 570 entries, 0 to 569 df columns (total 7 columns): ID1 570 non-null int64 Date1 570 non-null int64 Type1 566 non-null object ID2 570 non-null int64 Date2 570 non-null int64 Type2 570 non-null object Len1 570 non-null int64 dtypes: int64(5), object(2) memory usage: 31.2+ KB None /Users/.../anaconda2/lib/python2.7/site-packages/pandas/computation/eval.py:289: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy target[parsed_expr.assigner] = ret
答案 0 :(得分:3)
像这样的事情应该做的工作
编辑2:这实际上是非常讨厌的方式,可以在一个作业中多次评估Len1
和Len2
。
df.loc[:, ['Len1', 'Len2', 'Len_Diff', 'Date_Diff', 'Same_Type']] = \
pd.DataFrame([
df['ID1'].map(len_text),
df['ID2'].map(len_text),
np.abs(df['ID1'].map(len_text) - df['ID2'].map(len_text)),
np.abs(df['Date1'] - df['Date2']),
np.where(df['Type1']==df['Type2'],1,0)
])
然而,它的可读性远低于原始版本。
编辑:这是一个更好的方法,可以分为两行。
df.loc[:, ['Len1', 'Len2']] = \
pd.DataFrame([
df['ID1'].map(len_text),
df['ID2'].map(len_text)
])
df.loc[:, [ 'Len_Diff', 'Date_Diff', 'Same_Type'] = \
pd.DataFrame([
np.abs(df['Len1'] - df['Len2']),
np.abs(df['Date1'] - df['Date2']),
np.where(df['Type1']==df['Type2'],1,0)
])
答案 1 :(得分:2)
以下是您如何执行此操作的示例:
>>> df
a b c
0 None 1 None
1 None 2 None
2 None 3 None
3 None 4 None
>>> def f(val):
... return random.randint(1,10)
...
>>> df.loc[:,['a','c']] = df[['a','c']].applymap(f)
>>> df
a b c
0 3 1 7
1 10 2 10
2 6 3 4
3 4 4 8
所以,在你的情况下:
df.loc[:,['Len1', 'Len2']] = df[['ID1','ID2']].applymap(len_text)
但是,坦率地说,你可能会对丑陋的版本感觉更好,因为那时你会知道哪一列给你一个错误。
答案 2 :(得分:2)
您可以使用DataFrame.eval()方法:
In [254]: x
Out[254]:
ID1 ID2 Date1 Type1 Date2 Type2
0 123 456 1-Apr M 6-Apr L
1 234 567 20-Apr S 19-Apr S
2 345 678 10-Apr M 1-Jan M
In [255]: formulas = """
...: Len1 = @len_text(ID1)
...: Len2 = @len_text(ID2)
...: Len_Diff = Len1 - Len2
...: Len_Diff = Len_Diff.abs()
...: Same_Type = (Type1 == Type2) * 1
...: """
...:
In [256]: x.eval(formulas, inplace=False, engine='python')
Out[256]:
ID1 ID2 Date1 Type1 Date2 Type2 Len1 Len2 Len_Diff Same_Type
0 123 456 1-Apr M 6-Apr L 3 3 0 0
1 234 567 20-Apr S 19-Apr S 3 3 0 1
2 345 678 10-Apr M 1-Jan M 3 3 0 1
PS此解决方案假定len_text()
函数可以接受列(Pandas.Series)。例如:
def len_text(col):
return col.map(get_text).str.len()