我正在开发一个项目,我必须在其中组合多个数据集,即合并半相似的单词列表。这些词语符合某个词的某些含义。这是一些真实的数据:
home[a]: abode|domicile|dwelling|dwelling place|habitation|house|pad|residence
home[a]: abode|element|environment|habitat|habitation|haunt|home ground|range|stamping ground|territory
home[a]: at ease|comfortable|familiar|relaxed
home[a]: available|in|present
home[a]: birthplace|family|fireside|hearth|homestead|home town|household
home[a]: central|domestic|familiar|family|household|inland|internal|local|national|native
home[a]: conversant with|familiar with|knowledgeable|proficient|skilled|well-versed
home[a]: drive home|emphasize|impress upon|make clear|press home
home[a]: entertaining|giving a party|having guests|receiving
home[a]: party|reception|soirée
home[b]: abode|diggings|domicile|dwelling|fireside|habitation|hearth|hearthstone|house|lodging|pad|place|quarters|residence|roof
home[b]: cradle|birthplace|mother country|motherland|-|hometown|country|nativity|old country|roots
home[b]: extended family|household|house|ménage|-|blood|folks|kin|kindred|kinfolk|kinfolks|kinsfolk|kith|brood|nuclear family|clan|community
home[b]: fatherland|country|homeland|mother country|motherland|sod|-|old country|community|neighborhood
home[b]: habitat|niche|range|territory|-|element|environment|environs|haunt|locality|milieu|neighborhood|setting|surroundings
home[c]: abode|habitat|natural habitat|environment|natural element|natural territory|original habitation|home ground|stamping ground|haunt|domain
home[c]: domestic|internal|interior|local|national|native
home[c]: family|family background|family circle|household
home[c]: home town|birthplace|homeland|native land|fatherland|motherland|mother country|country of origin
home[c]: homegrown|homemade|homespun
home[c]: house|abode|domicile|residence|dwelling|dwelling place|habitation
home[c]: house|apartment|condominium|bungalow|cottage
home[c]: residential home|institution|shelter|refuge|hostel|hospice|retirement home|nursing home|rest home|convalescent home|convalescent hospital|children's home
home[d]: domestic|internal|local|national|interior
home[d]: homemade|homegrown|family
home[d]: institution|nursing home|retirement home|rest home|children's home|hospice|shelter|refuge|retreat|asylum|hostel|halfway house
home[d]: origin|source|cradle|fount|fountainhead
home[d]: residence|place of residence|house|apartment|flat|bungalow|cottage|accommodations|property|quarters|rooms|lodgings|a roof over one's head|address|place
home[e]: domestic|internal|local|national|interior
home[e]: homemade|homegrown|family
home[e]: institution|nursing home|retirement home|rest home|children's home|hospice|shelter|refuge|retreat|asylum|hostel|halfway house
home[e]: origin|source|cradle|fount|fountainhead
home[e]: residence|place of residence|house|apartment|flat|bungalow|cottage|accommodations|property|quarters|rooms|lodgings|a roof over one's head|address|place|pad|digs|hearth|nest|domicile|abode|dwelling|dwelling place|habitation
home[f]: home base or origin birthplace|native land|fatherland|habitat
home[f]: residence house|domicile|abode|habitation|dwelling|quarters|place|castle|lodging|roost|diggings|pad
语法没有意义。
注意一些列表非常相似。例如,这两个:
home[c]: homegrown|homemade|homespun
home[d]: homemade|homegrown|family
目标是将基本相似的列表组合在一起。我已经解决了堵塞的问题(例如匹配"居住"和#34;居住"),我需要更大的%匹配(例如50%)或最小匹配数(例如3)。
但是,我一直遇到我设计的算法问题。最大的问题是大型名单往往会因为滚雪球效应而吞噬其他名单。我最终会得到8个列表,其中一个列表有150多个项目。
我很欣赏有关如何合并这些内容的任何想法。直观地说,我认为这36个列表应该缩减到10左右。也许想法使用物理吸引模型或多次传递。想法?
欢迎使用伪代码或建议的算法。我主要使用ruby进行编码,但任何语言都可以。我加入了' sort'标记,因为我想知道排序类型的传递是否有帮助。