Question

我正在编写下面的Python代码来合并两个表，这可以使用Vlookup在Excel中完成，但是想要为更大的数据集自动执行此过程。但是，输出数据似乎太大并且包含来自两个表的所有列。我只是想使用第二个表，df_pos来查找一些列。如果我的代码执行此任务有效或可行，您会看看吗？

谢谢！

def weighted(mwa="mwa.csv",mwa2="mwa.csv",output="WeightedMWA.csv"):
    df=pd.read_csv(mwa, thousands=",")
    df['Keyword']=df['Keyword'].replace('+','')
    df_pos=pd.read_csv("mwa.csv", thousands=",")
    df_pos['Keyword']=df_pos['Keyword'].replace('+','')
    sumImp=df_pos['Impr.'].sum()
    sumPos=df_pos.groupby(by=['Keyword'])['Avg. Pos.'].sum()
    df_pos['WeightedPos']=sumPos/sumImp
    mergedDF=pd.merge(left=df, right=df_pos, how="left", left_on="Keyword",right_on="Keyword")
    mergedDF.to_csv(output)

Answer 1

您没有向我们提供足够的信息。您正在输出合并的数据帧，但您还没有告诉输出中哪些列是必需的。理想情况下，您只想保留输出中所需的列以及合并所需的列。

您可以通过read_csv函数及其usecols参数限制导入的列。 documentation说：

usecols : array-like, default None
    Return a subset of the columns. All elements in this array must either
    be positional (i.e. integer indices into the document columns) or strings
    that correspond to column names provided either by the user in `names` or
    inferred from the document header row(s). For example, a valid `usecols`
    parameter would be [0, 1, 2] or ['foo', 'bar', 'baz']. Using this parameter
    results in much faster parsing time and lower memory usage.

Answer 2

如果您只是使用df_pos查找来自其他矩阵的数据，只需使用df_pos中的字段作为您正在查找数据的帧的索引，即datasourcematrix [df_pos .LOOKUPCOLUMNNAME]或者如果你没有列名，你可以做datasourcematrix [df_pos.ix [5]]或其他什么。更容易，更快......

python pandas merge / vlookup表

2 个答案: