我正在使用sklearn.datasets.fetch_20newsgroups()数据集。这里有一些文档属于多个新闻组。我想将这些文档视为两个不同的实体,每个实体属于一个新闻组。为此,我将文档ID和组名称放入数据框中。
import sklearn
from sklearn import datasets
data = datasets.fetch_20newsgroups()
filepaths = data.filenames.astype(str)
keys = []
for path in filepaths:
keys.append(os.path.split(path)[1])
groups = pd.DataFrame(keys, columns = ['Document_ID'])
groups['Group'] = data.target
groups.head()
>> Document_ID Group
0 102994 7
1 51861 4
2 51879 4
3 38242 1
4 60880 14
print (len(groups))
>>11314
print (len(groups['Document_ID'].drop_duplicates()))
>>9840
print (len(groups['Group'].drop_duplicates()))
>>20
对于每个Document_ID,如果分配了多个组编号,我想更改其值。例如,
groups[groups['Document_ID']=='76139']
>> Document_ID Group
5392 76139 6
5680 76139 17
我希望这成为:
>> Document_ID Group
5392 76139 6
5680 12345 17
此处,12345是一个未在keys
列表中的随机新ID。
我该怎么做?
答案 0 :(得分:1)
您可以在第一个使用Document_ID
方法后找到包含重复duplicated
的所有行。然后创建一个以超过最大ID开头的新id列表。使用loc
索引运算符使用新ID覆盖重复键。
groups['Document_ID'] = groups['Document_ID'].astype(int)
dupes = groups.Document_ID.duplicated(keep='first')
max_id = groups.Document_ID.max() + 1
new_id = range(max_id, max_id + dupes.sum())
groups.loc[dupes, 'Document_ID'] = new_id
测试用例
groups.loc[[5392,5680]]
Document_ID Group
5392 76139 6
5680 179489 17
确保没有重复项。
groups.Document_ID.duplicated(keep='first').any()
False
答案 1 :(得分:0)
Kinda Hacky,但为什么不呢!
data = {"Document_ID": [102994,51861,51879,38242,60880,76139,76139],
"Group": [7,1,3,4,4,6,17],
}
groups = pd.DataFrame(data)
groupDict ={}
tempLst=[]
#Create a list of unique ID's
DocList = groups['Document_ID'].unique()
DocList.tolist()
#Build a dictionary and push all group ids to the correct doc id
DocDict = {}
for x in DocList:
DocDict[x] = []
for index, row in groups.iterrows():
DocDict[row['Document_ID']].append(row['Group'])
#For all doc Id's with multip entries create a new id with the group id as a decimal point.
groups['DupID'] = groups['Document_ID'].apply(lambda x: len(DocDict[x]))
groups["Document_ID"] = np.where(groups['DupID'] > 1, groups["Document_ID"] + groups["Group"]/10,groups["Document_ID"])
希望有帮助...