我正在尝试以下示例:https://recordlinkage.readthedocs.io/en/latest/notebooks/data_deduplication.html
以下是代码段:
import recordlinkage
from recordlinkage.datasets import load_febrl1
dfA = load_febrl1()
# Indexation step
indexer = recordlinkage.Index()
indexer.block(left_on='given_name')
candidate_links = indexer.index(dfA)
compare_cl = recordlinkage.Compare()
compare_cl.exact('given_name', 'given_name', label='given_name')
compare_cl.string('surname', 'surname', method='jarowinkler', threshold=0.85, label='surname')
compare_cl.exact('date_of_birth', 'date_of_birth', label='date_of_birth')
compare_cl.exact('suburb', 'suburb', label='suburb')
compare_cl.exact('state', 'state', label='state')
compare_cl.string('address_1', 'address_1', threshold=0.85,
label='address_1')
features = compare_cl.compute(candidate_links, dfA)
matches = features[features.sum(axis=1) > 3]
print(len(matches))
我现在想单独打印已经匹配的record_ids,我尝试列出``matches''的列名,但record_id不是其中的一部分,而且我似乎无法找出一种获取方法完成了(我只想要单独的record_ids)
是否有一种方法来检索record_ids,或者单独打印它或将其存储为列表或数组?
答案 0 :(得分:1)
Don't forget that a Pandas data frame has an "index" in addition to its data columns. Usually this is a single "extra" column of integers or strings, but more complex indices are possible, e.g. a "multi-index" consisting of more than one column.
You can see this if you print(matches.head())
. The first two columns have names that are slightly offset, because they aren't data columns; they are columns in the index itself. This data frame index is in fact a multi-index containing two columns: rec_id_1
and rec_id_2
.
The result from load_febrl
encodes record ID as the index of dfA
. Compare.compute
preserves the indices of the input data: you can always expect the indices from the original data to be preserved as a multi-index.
The index of a data frame by itself can be accessed with the DataFrame.index
attribute. This returns an Index
object (of which MultiIndex
is a subclass) that can in turn be converted as follows:
Index.tolist()
: convert to a list
of its elements; MultiIndex
becomes a list
of tuple
sIndex.to_series()
: convert to a Series
of its elements; MultiIndex
becomes a Series
of tuple
sIndex.values
: access underlying data as NumPy ndarray
; MultiIndex
becomes a ndarray
of tuple
s.Index.to_frame()
: convert to a DataFrame
, with index columns as data frame columnsSo you can quickly access the record id's with matches.index
, or export them to a list with matches.tolist()
.
You can also use matches.reset_index()
to turn Index columns back into regular data columns.
答案 1 :(得分:0)
这是在index和reset_index属性上使用熊猫合并完成答案的代码
这会将多索引转换为名为level_0,level_1的列
matches = matches.reset_index()
我们可以看到列level_0与dfA中的索引相同
matches.columns
dfA.index
现在按索引和级别_0将其与dfA合并
import pandas as pd
matched_dfA=pd.merge(matches,dfA,left_on='rec_id_1',right_index=True)
检查结果
matched_dfA.head()