我将如何返回这两个数据帧之间的差异?有没有办法弥补错别字?

时间:2019-07-01 02:04:21

标签: python pandas dataframe

因此,该项目的目标是抓取前100名列表的结果,查询数据库以查看其中是否包含这些标题,并返回该数据库中未包含的所有前100首歌曲的信息。数据集如下:

top 100 = {{"1": {"artist": "Lil Nas X Featuring Billy Ray Cyrus", "rank": 1, "title": "Old Town Road"}, "2": {"artist": "Taylor Swift", "rank": 2, "title": "You Need To Calm Down"}, "3": {"artist": "Billie Eilish", "rank": 3, "title": "Bad Guy"}, "4": {"artist": "Khalid", "rank": 4, "title": "Talk"}, "5": {"artist": "Ed Sheeran & Justin Bieber", "rank": 5, "title": "I Don't Care"}, "6": {"artist": "Jonas Brothers", "rank": 6, "title": "Sucker"}, "7": {"artist": "Drake Featuring Rick Ross", "rank": 7, "title": "Money In The Grave"}, "8": {"artist": "Post Malone", "rank": 8, "title": "Wow."}, "9": {"artist": "DaBaby", "rank": 9, "title": "Suge"}, "10": {"artist": "Chris Brown Featuring Drake", "rank": 10, "title": "No Guidance"}, "11": {"artist": "Post Malone & Swae Lee", "rank": 11, "title": "Sunflower (Spider-Man: Into The Spider-Verse)"}, "12": {"artist": "Sam Smith & Normani", "rank": 12, "title": "Dancing With A Stranger"}, "13": {"artist": "Shawn Mendes", "rank": 13, "title": "If I Can't Have You"}, "14": {"artist": "Lizzo", "rank": 14, "title": "Truth Hurts"}, "15": {"artist": "Taylor Swift Featuring Brendon Urie", "rank": 15, "title": "ME!"}, "16": {"artist": "Halsey", "rank": 16, "title": "Without Me"}, "17": {"artist": "Ava Max", "rank": 17, "title": "Sweet But Psycho"}, "18": {"artist": "Polo G Featuring Lil Tjay", "rank": 18, "title": "Pop Out"}, "19": {"artist": "Ariana Grande", "rank": 19, "title": "7 Rings"}, "20": {"artist": "Marshmello & Bastille", "rank": 20, "title": "Happier"}, "21": {"artist": "Blake Shelton", "rank": 21, "title": "God's Country"}, "22": {"artist": "Morgan Wallen", "rank": 22, "title": "Whiskey Glasses"}, "23": {"artist": "Panic! At The Disco", "rank": 23, "title": "High Hopes"}, "24": {"artist": "Panic! At The Disco", "rank": 24, "title": "Hey Look Ma, I Made It"}, "25": {"artist": "Lee Brice", "rank": 25, "title": "Rumor"}, "26": {"artist": "Young Thug, J. Cole & Travis Scott", "rank": 26, "title": "The London"}, "27": {"artist": "Daddy Yankee & Katy Perry Featuring Snow", "rank": 27, "title": "Con Calma"}, "28": {"artist": "Luke Combs", "rank": 28, "title": "Beer Never Broke My Heart"}, "29": {"artist": "Katy Perry", "rank": 29, "title": "Never Really Over"}, "30": {"artist": "J. Cole", "rank": 30, "title": "Middle Child"}, "31": {"artist": "benny blanco, Halsey & Khalid", "rank": 31, "title": "Eastside"}, "32": {"artist": "City Girls", "rank": 32, "title": "Act Up"}, "33": {"artist": "Mustard & Migos", "rank": 33, "title": "Pure Water"}, "34": {"artist": "Meek Mill Featuring Drake", "rank": 34, "title": "Going Bad"}, "35": {"artist": "Drake", "rank": 35, "title": "Omerta"}, "36": {"artist": "Tyler, The Creator", "rank": 36, "title": "Earfquake"}, "37": {"artist": "Thomas Rhett", "rank": 37, "title": "Look What God Gave Her"}, "38": {"artist": "Khalid", "rank": 38, "title": "Better"}, "39": {"artist": "Lady Gaga & Bradley Cooper", "rank": 39, "title": "Shallow"}, "40": {"artist": "A Boogie Wit da Hoodie", "rank": 40, "title": "Look Back At It"}, "41": {"artist": "Ariana Grande", "rank": 41, "title": "Break Up With Your Girlfriend, I'm Bored"}, "42": {"artist": "Travis Scott", "rank": 42, "title": "Sicko Mode"}, "43": {"artist": "Dan + Shay", "rank": 43, "title": "Speechless"}, "44": {"artist": "Halsey", "rank": 44, "title": "Nightmare"}, "45": {"artist": "Billie Eilish", "rank": 45, "title": "When The Party's Over"}, "46": {"artist": "Ed Sheeran Featuring Chance The Rapper & PnB Rock", "rank": 46, "title": "Cross Me"}, "47": {"artist": "Calboy", "rank": 47, "title": "Envy Me"}, "48": {"artist": "Kane Brown", "rank": 48, "title": "Good As You"}, "49": {"artist": "YG, Tyga & Jon Z", "rank": 49, "title": "Go Loko"}, "50": {"artist": "Jonas Brothers", "rank": 50, "title": "Cool"}, "51": {"artist": "Blanco Brown", "rank": 51, "title": "The Git Up"}, "52": {"artist": "Lil Tecca", "rank": 52, "title": "Ran$om"}, "53": {"artist": "DJ Khaled Featuring SZA", "rank": 53, "title": "Just Us"}, "54": {"artist": "Lewis Capaldi", "rank": 54, "title": "Someone You Loved"}, "55": {"artist": "P!nk", "rank": 55, "title": "Walk Me Home"}, "56": {"artist": "YK Osiris", "rank": 56, "title": "Worth It"}, "57": {"artist": "Cardi B & Bruno Mars", "rank": 57, "title": "Please Me"}, "58": {"artist": "Offset Featuring Cardi B", "rank": 58, "title": "Clout"}, "59": {"artist": "Luke Bryan", "rank": 59, "title": "Knockin' Boots"}, "60": {"artist": "Cardi B", "rank": 60, "title": "Press"}, "61": {"artist": "Maren Morris", "rank": 61, "title": "GIRL"}, "62": {"artist": "5 Seconds Of Summer", "rank": 62, "title": "Easier"}, "63": {"artist": "Meek Mill Featuring Ella Mai", "rank": 63, "title": "24/7"}, "64": {"artist": "Summer Walker X Drake", "rank": 64, "title": "Girls Need Love"}, "65": {"artist": "Eric Church", "rank": 65, "title": "Some Of It"}, "66": {"artist": "Dan + Shay", "rank": 66, "title": "All To Myself"}, "67": {"artist": "NLE Choppa", "rank": 67, "title": "Shotta Flow"}, "68": {"artist": "Bad Bunny & Tainy", "rank": 68, "title": "Callaita"}, "69": {"artist": "Jason Aldean", "rank": 69, "title": "Rearview Town"}, "70": {"artist": "Kelsea Ballerini", "rank": 70, "title": "Miss Me More"}, "71": {"artist": "Brett Eldredge", "rank": 71, "title": "Love Someone"}, "72": {"artist": "Beyonce", "rank": 72, "title": "Before I Let Go"}, "73": {"artist": "Florida Georgia Line", "rank": 73, "title": "Talk You Out Of It"}, "74": {"artist": "DJ Khaled Featuring Cardi B & 21 Savage", "rank": 74, "title": "Wish Wish"}, "75": {"artist": "Dreamville Featuring JID, Bas, J. Cole, EARTHGANG & Young Nudy", "rank": 75, "title": "Down Bad"}, "76": {"artist": "Chase Rice", "rank": 76, "title": "Eyes On You"}, "77": {"artist": "Lunay, Daddy Yankee & Bad Bunny", "rank": 77, "title": "Soltera"}, "78": {"artist": "Lil Uzi Vert", "rank": 78, "title": "Sanguine Paradise"}, "79": {"artist": "Marshmello Featuring CHVRCHES", "rank": 79, "title": "Here With Me"}, "80": {"artist": "Joji", "rank": 80, "title": "Sanctuary"}, "81": {"artist": "Sech Featuring Darell", "rank": 81, "title": "Otro Trago"}, "82": {"artist": "The Chainsmokers & Bebe Rexha", "rank": 82, "title": "Call You Mine"}, "83": {"artist": "Chris Young", "rank": 83, "title": "Raised On Country"}, "84": {"artist": "SHAED", "rank": 84, "title": "Trampoline"}, "85": {"artist": "Eli Young Band", "rank": 85, "title": "Love Ain't"}, "86": {"artist": "Billie Eilish", "rank": 86, "title": "Ocean Eyes"}, "87": {"artist": "Yella Beezy, Gucci Mane & Quavo", "rank": 87, "title": "Bacc At It Again"}, "88": {"artist": "Pedro Capo X Farruko", "rank": 88, "title": "Calma"}, "89": {"artist": "Travis Scott", "rank": 89, "title": "Wake Up"}, "90": {"artist": "Bryce Vine Featuring YG", "rank": 90, "title": "La La Land"}, "91": {"artist": "Jonas Brothers", "rank": 91, "title": "Only Human"}, "92": {"artist": "Marshmello Featuring A Day To Remember", "rank": 92, "title": "Rescue Me"}, "93": {"artist": "Megan Thee Stallion", "rank": 93, "title": "Big Ole Freak"}, "94": {"artist": "Nicky Jam X Ozuna", "rank": 94, "title": "Te Robare"}, "95": {"artist": "NAV Featuring Meek Mill", "rank": 95, "title": "Tap"}, "96": {"artist": "Ozuna x Daddy Yankee x J Balvin x Farruko x Anuel AA", "rank": 96, "title": "Baila Baila Baila"}, "97": {"artist": "Ali Gatie", "rank": 97, "title": "It's You"}, "98": {"artist": "Juice WRLD", "rank": 98, "title": "Robbery"}, "99": {"artist": "Nipsey Hussle Featuring Roddy Ricch & Hit-Boy", "rank": 99, "title": "Racks In The Middle"}, "100": {"artist": "Justin Moore", "rank": 100, "title": "The Ones That Didn't Make It Back Home"}}
database_results = [(u'Old Town Road', u'Lil Nas X featuring Billy Ray Cyrus'), (u'Talk', u'Coldplay'), (u'Talk', u'Khalid'), (u'Sucker', u'Jonas Brothers'), (u"I Don't Care", u'Buck Owens'), (u"I Don't Care", u'Fallout Boy'), (u"I Don't Care", u'Justin Bieber'), (u'Sunflower (Spider-Man: Into the Spider-Verse)', u'Post Malone & Swae Lee'), (u'Dancing With A Stranger', u'Sam Smith'), (u"If I Can't Have You", u'Shawn Mendes'), (u'Sweet But Psycho', u'Ava Max'), (u'Without Me', u'Halsey'), (u'Happier', u'Ed Sheeran'), (u'Happier', u'Marshmello'), (u"God's Country", u'Blake Shelton'), (u'Whiskey Glasses', u'Morgan Wallen'), (u'High Hopes', u'Panic! At  the disco'), (u'Beer Never Broke My Heart', u'Luke Combs'), (u'Never Really Over', u'Katy Perry'), (u'Hey Look Ma, I Made It', u'Panic! At the Disco'), (u'Speechless', u'Dan + Shay'), (u'Speechless', u'Hanson'), (u'Shallow', u'Lady Gaga & Bradley Cooper'), (u'BETTER', u'GUNS N" ROSES'), (u'Better', u'Khalid'), (u'Rumor', u'Lee Brice'), (u'Look What God Gave Her', u'Thomas Rhett'), (u"when the party's over", u'Billie Eilish'), (u'Cool', u'Gwen Stefani'), (u'Cool', u'Jonas Brothers'), (u'Beautiful Crazy', u'Luke Combs'), (u'Good As You', u'Kane Brown'), (u'Love Someone', u'Brett Eldredge'), (u'Love Someone', u'Lucas Graham'), (u'Someone You Loved', u'Lewis Capaldi'), (u'Miss Me More', u'Kelsea Ballerini'), (u'Walk Me Home', u'Mandy Moore'), (u'Walk Me Home', u'P!nk'), (u'Girl', u'Destiny\xb4s Child'), (u'Girl', u'Maren Morris'), (u'Girl', u'The Beatles'), (u"Knockin' Boots", u'Luke Bryan'), (u'Rearview Town', u'jason Aldean'), (u'All To Myself', u'Dan + Shay'), (u'Eyes On You', u'Chase Rice'), (u'Some of It', u'Eric Church'), (u'Here With Me', u'Mercyme'), (u'Talk You Out of It', u'Florida Georgia Line'), (u"Love Ain't", u'Eli Young Band'), (u'Heaven', u'Bryan Adams'), (u'Heaven', u'Derek Miller'), (u'Heaven', u'Kane Brown'), (u'Heaven', u'Salvador'), (u'Heaven', u'State of Sound'), (u'Heaven', u'Three Doors Down'), (u'Heaven', u'Warrant'), (u'Call You Mine', u'Chainsmokers (Feat. Bebe Rexha)'), (u'Ocean Eyes', u'Billie Eilish'), (u'On My Way to You', u'Cody Johnson'), (u'On My Way To You', u'Mercy Me'), (u'Raised on Country', u'Chris Young')]

我已经能够格式化结果,以便将它们构造成单独的数据帧。

前100名列表

Top 100 List

数据库从前100名列表中搜索标题

Database Results

我想做的是检查数据库搜索结果中没有包含前100个列表中的哪些值。我的想法是,我能够生成一首需要购买的歌曲列表,以便将前100个播放列表放在一起。

到目前为止,我已经可以通过以下示例返回两个数据框中的每个项目的列表:

set(df_t['title']).intersection(set(df2['title']))

哪个产量:

{'All To Myself',
 'Beer Never Broke My Heart',
 'Better',
 'Call You Mine',
 'Cool',
 'Dancing With A Stranger',
 'Eyes On You',
 "God's Country",
 'Good As You',
 'Happier',
 'Here With Me',
 'Hey Look Ma, I Made It',
 'High Hopes',
 "I Don't Care",
 "If I Can't Have You",
 "Knockin' Boots",
 'Look What God Gave Her',
 "Love Ain't",
 'Love Someone',
 'Miss Me More',
 'Never Really Over',
 'Ocean Eyes',
 'Old Town Road',
 'Rearview Town',
 'Rumor',
 'Shallow',
 'Someone You Loved',
 'Speechless',
 'Sucker',
 'Sweet But Psycho',
 'Talk',
 'Walk Me Home',
 'Whiskey Glasses',
 'Without Me'}

但是,这与我想要的相反-我想知道查询未返回前100个列表中的哪些值-并且它存在一个额外的问题,即无法说明事实是,如果另一位艺术家拥有同一标题的歌曲,则可能会导致误报。因此,我尝试了以下方法:

set(df_t['artist'] + ': ' + df_t['title']).intersection(set(df2['artist']+ ': ' + df2['title']))

哪种产量:

{'Ava Max: Sweet But Psycho',
 'Billie Eilish: Ocean Eyes',
 "Blake Shelton: God's Country",
 'Brett Eldredge: Love Someone',
 'Chase Rice: Eyes On You',
 'Dan + Shay: All To Myself',
 'Dan + Shay: Speechless',
 "Eli Young Band: Love Ain't",
 'Halsey: Without Me',
 'Jonas Brothers: Cool',
 'Jonas Brothers: Sucker',
 'Kane Brown: Good As You',
 'Katy Perry: Never Really Over',
 'Kelsea Ballerini: Miss Me More',
 'Khalid: Better',
 'Khalid: Talk',
 'Lady Gaga & Bradley Cooper: Shallow',
 'Lee Brice: Rumor',
 'Lewis Capaldi: Someone You Loved',
 "Luke Bryan: Knockin' Boots",
 'Luke Combs: Beer Never Broke My Heart',
 'Morgan Wallen: Whiskey Glasses',
 'P!nk: Walk Me Home',
 "Shawn Mendes: If I Can't Have You",
 'Thomas Rhett: Look What God Gave Her'}

因此,它可以通过包括匹配艺术家的需求来过滤出一些结果,但不能解决诸如以下的细微差异:

Lil Nas X饰有Billy Ray Cyrus

Lil Nas X饰有Billy Ray Cyrus

因此,如果有一种方法可以处理轻微的拼写错误/区分大小写,同时返回不在各自数据框中的值,请告诉我。

-更新-

因此,通过尝试以下操作,我走了一些距离:

set(df_t['artist'].str.lower() + ': ' + df_t['title'].str.lower()).symmetric_difference(set(df2['artist'].str.lower()+ ': ' + df2['title'].str.lower()))

这能够给我带来差异,但是它会返回两个数据帧之间的差异,而我只想查看前100个结果中的哪个丢失。

2 个答案:

答案 0 :(得分:0)

两个列的str.lower均适用:

set(df_t['title'].str.lower()).intersection(set(df2['title'].str.lower()))

答案 1 :(得分:0)

因此,有两种不同的方法来处理我上面写的内容,但是我能找到的是以下内容。如上所述,可以使用集合进行比较,并且可以通过以下方式处理大小写差异:

set(df_t['artist'].str.lower() + ': ' + df_t['title'].str.lower()).symmetric_difference(set(df2['WinMediaartist'].str.lower()+ ': ' + df2['WinMediatitle'].str.lower()))

这将返回数据帧之间的差异,而不会被大写字母绊倒,但它将返回两个方向上的差异。因此,如果查询结果具有不同于“公告牌” 100中存在的艺术家的另一个名为“对话”的标题,它将显示该差异并产生某种误报。因为我只想知道数据库中哪些 不是前100首歌曲,所以这不是一个可行的解决方案。

因此,我决定合并两个数据框:

df_compare = df_t.merge(df2, left_on='title', how='left', right_on='WinMediatitle')

因为我要合并到df_t(广告牌数据框架)的左侧,所以我可以看到df2(查询)数据框架中没有哪些BillBoard标题。现在,由于我已经合并了“标题”,所以最终仍然会出现标题和艺术家之间不匹配的情况:

df_compare.head(10)

差异是显而易见的,但是我需要做更多的工作才能看到缺少的内容。此时,我可以执行以下操作:

df_missing = df_compare[df_compare['WinMediatitle'].isnull()]
df_missing.drop(columns=["WinMediatitle","WinMediaartist"])

哪个返回: enter image description here 这是图书馆中缺少哪些书名的合理近似,但无法处理以下情况:

  1. 如果标题相同,则在两个艺术家之间,但是该艺术家与前100名列表中的艺术家不同。这样可以创建一个场景,假设我们存在一个曲目,但这只是来自其他艺术家的标题。
  2. 我需要一些可以让一部分艺术家呈现的东西,因为BillBard的结果往往是完整的结果-图书馆内的Justin Beiber和Ed Sheeran vs. Justin Bieber说。如果我强制要求“艺术家”和“标题”必须相同,则会返回NaN。

这些只是我对上述熊猫比较中未涵盖的场景的快速发现,我认为随着时间的推移将会出现更多的场景。也许我可以弄清楚如何处理所有这些问题,但是我认为尝试为曲目找到一些目录号或唯一标识符将是一种轻松解决所有问题的简便方法。我在想像ISRC之类的东西,但是即使那样也不能解决诸如(实时与非实时错过结果)之类的因素。可以说,这比我最初的想象要复杂。