我有这个问题,我采用白宫访客的2009/2010数据集,带有这些标题的csv。
https://obamawhitehouse.archives.gov/files/disclosures/visitors/WhiteHouse-WAVES-Key-1209.txt
我想提取2009年和2010年访问过的所有访问者姓名的名称。
我有这个功能,但它太慢了。有没有概念上更快的方法呢?
def task3():
culled_data = data[["NAMELAST", "NAMEFIRST", "TOA", "TOD"]]
data9 = culled_data[culled_data["TOA"].str.contains("2009", na = False)]
data10 = culled_data[culled_data["TOA"].str.contains("2010", na = False)]
unique_names = pandas.DataFrame({'count':\
data.groupby(["NAMELAST", "NAMEFIRST"]).size()}).reset_index()
unqiue_names = unique_names[unique_names["count"] > 1]
count = 0
for index, row in unique_names.iterrows():
if data9[data9.NAMELAST == row["NAMELAST"]].shape[0] > 0 and data10[data10.NAMELAST == row["NAMELAST"]].shape[0] > 0 and data9[data9.NAMEFIRST == row["NAMEFIRST"]].shape[0] > 0 and data10[data10.NAMEFIRST == row["NAMEFIRST"]].shape[0] > 0:
count += 1
else:
unique_names = unique_names[unique_names.NAMELAST != row["NAMELAST"]]
return count, unique_names
答案 0 :(得分:0)
一种方法是使用python sets:
fullnames9 = set([' '.join(r) for r in data9[['NAMEFIRST', 'NAMELAST']].values])
fullnames10 = set([' '.join(r) for r in data10[['NAMEFIRST', 'NAMELAST']].values])
names_who_visited_in_both_years = fullnames9 & fullnames10 # set intersection
请注意,如果两个不同的人具有相同的名字和姓氏,则此代码将错误地断定他们在这两年中访问过。此外,这只获得全名。获取两年访问过的人的DataFrame索引会更有用,并留作练习;)