假设我具有以下数据框:
xx yy tt
0 2.8 1.0 1.0
1 85.0 4.48 6.5
2 2.1 8.0 1.0
3 8.0 1.0 0.0
4 9.0 2.54 1.64
5 5.55 7.25 3.15
6 1.66 0.0 4.0
7 3.0 7.11 1.98
8 1.0 0.0 4.65
9 1.87 2.33 0.0
我想用它来创建一个for循环,该循环遍历df中的所有点并计算到所有其他点的欧几里得距离。例如:循环将遍历点a并获得从点a到点b,c,d ... n的距离。然后它将到达点b,它将获得到点a,c,d ... n等的距离。
一旦我获得了距离,我想拥有一个value_counts()
的距离值,但是为了节省内存,我不能只是value_counts()
从这个foor循环中获得的所有结果,因为我的实际df太大,最终我将耗尽内存。
所以我想对距离向量执行value_counts()
操作,这将给出2列数据帧,其中包含值和它们各自的计数,然后在点b上进行迭代并获得所有距离,我想将新值与第一个循环中的前一个value_counts()
df进行比较,并检查是否有重复的值,如果是,那么我想+=
计数器来获取重复的值,如果找不到重复值,我想append()
将所有那些没有重复值的行都添加到距离df。
这是我到目前为止所得到的:
import pandas as pd
counts = pd.DataFrame()
for index, row in df.iterrows():
dist = pd.Series(np.sqrt((row.xx - df.xx)**2 + (row.yy - df.yy)**2 + (row.tt - df.tt)**2)) # Create a vector containing all the distances from each point to the others
counter = pd.Series(dist.value_counts(sort = True)).reset_index().rename(columns = {'index': 'values', 0:'counts'}) # Get a counter for every value in the distances vector
if index in counter['values']:
counter['counts'][index] += 1 # Check if the new values are in the counter df, if so, add +1 to each repeated value
else:
counts = counts.append((index,row)) # If no repeated values, then append new rows to the counter df
预期结果将是这样的:
# These are the value counts for point a and its distances:
values counts
0 0.000000 644589
1 0.005395 1
2 0.005752 1
3 0.016710 1
4 0.023043 1
5 0.012942 1
6 0.020562 1
现在在点b上进行迭代:
values counts
0 0.000000 644595 # Value repeated 6 times, so add +6 to the counter
1 0.005395 1
2 0.005752 1
3 0.016710 3 # Value repeated twice, so add +2 to the counter
4 0.023043 1
5 0.012942 1
6 0.020562 1
7 0.025080 1 # New value, so append a new row with value and counter
8 0.022467 1 # New value, so append a new row with value and counter
但是,如果将print (counts)
添加到循环的末尾以检查此循环正在执行的结果,则会看到一个空的数据框。这就是为什么我问这个问题。为什么这段代码给了一个空的df,如何使它按我想要的方式工作?
如果您需要更多解释,不清楚的地方或需要更多信息,请随时提出要求。
预先感谢
答案 0 :(得分:1)
如果了解您,则希望出现每个距离值:
所以我建议您创建一个字典:键是值,键的值是计数:
data = """
xx yy tt
2.8 1.0 1.0
85.0 4.48 6.5
2.1 8.0 1.0
8.0 1.0 0.0
9.0 2.54 1.64
5.55 7.25 3.15
1.66 0.0 4.0
3.0 7.11 1.98
1.0 0.0 4.65
1.87 2.33 0.0
"""
import pandas as pd
df = pd.read_csv(pd.compat.StringIO(data), sep='\s+')
dico ={} #i initialize the dict dico
for index, row in df.iterrows():
dist = pd.Series(np.sqrt((row.xx - df.xx) ** 2 + (row.yy - df.yy) ** 2 +
(row.tt - df.tt) ** 2)) # Create a vector containing all the
#distances from each point to the others
for f in dist: #i iterate through dist
if f in dico: #the key already exists in dict?
dico[f] +=dico[f] #yes i increment the value
else:
dico[f]=1 #no i create the key with the new distance and set to 1
print(dico)
输出:
{0.0: 512,
82.45726408267497: 2,
7.034912934784623: 2,
5.295280917949491: 2,
6.4203738208923635: 2,
7.158735921934822: 2,
3.361487765856065: 2,
6.191324575565393: 2,
4.190763653560053: 2,
1.9062528688503002: 2,
83.15678204452118: 2,
77.35218419669867: 2,
76.17993961667337: 2,
79.56882492534372: 2,
:
:
7.511863949779708: 2,
0.9263368717696604: 2,
4.633896848226123: 2,
7.853725230742415: 2,
5.295819105671946: 2,
5.273357564208974: 2}
每个值至少具有2个计数,因为其交叉表和距离(从point0到point1)等于距离(从point1到point0)...。