Question

我对编程非常陌生，不得不学会让我的博士项目的一部分实际上可行，但在获取我的python代码从网站提取数据并将其写入excel文件后，我有点在下一部分丢失了。

我有我的excel文件和提供的辅助文件。我试图在两个文件之间进行搜索（两个文件都有一个＆＃39;地址＆＃39;，一旦匹配，从提供的文件中拉出一个标签（在类别下）并将其输入到myfile。或者如果这更容易只需将结果写入一个全新的excel文件。

＆＃39; MYFILE＆＃39;

#| Address | 1 | 21 Abbotsford Street Falkirk FK2 7NH | 2 | Police Station Commissioner Street Bo'ness EH51 9AF | 3 | 4 Riverview Terrace Bo'ness EH51 9ED |

结果文件

#| Address |Category 1 | 21 Abbotsford Street Falkirk FK2 7NH | A 2 | Police Station Commissioner Street Bo'ness EH51 9AF | B 3 | 4 Riverview Terrace Bo'ness EH51 9ED | A

我遇到的问题是两个文件＆＃39;地址＆＃39;数据没有任何特定的顺序，所以我如何从myfile中获取地址，在提供的文件中搜索，拉出＆＃39;类别＆＃39; ，然后将地址/类别合并到结果文件中（甚至只是将＆＃39;类别＆＃39;添加到myfile中。

如果这一点不明确，我也非常抱歉，我尽力正确地说出来，但感谢您的任何建议，甚至感谢我可以作为一个扩展来帮助解决这个问题。 :)

Answer 1

只需运行与地址的完全值匹配的左连接merge：

import pandas as pd

df1 = pd.read_excel('myFile.xlsx', sheetname=0)          # ASSUMING DATA IN FIRST SHEET
df2 = pd.read_excel('OtherFile.xlsx', sheetname=0)       # ASSUMING DATA IN FIRST SHEET

outcomedf = pd.merge(df1, df2[['Address', 'Category']], on='Address', how='left')

Answer 2

我不完全确定您是否尝试将两个文件合并为一个并检查地址重复项，或者您是否尝试检查myfiles中的地址是否包含在第二个文件中，因此我尝试在下面提供这两个文件

如果您想要合并并比较两个文件：

import pandas as pd

# Reads myfile Excel file to new data frame
df1 = pd.read_excel("C:/folder/myfile.xlsx")
# Reads second file (a .csv in this example) to new data frame
df2 = pd.read_csv("C:/folder/secondfile.csv")

# Creates a third data frame containing df1 and df2
df3 = df1.append(df2)
# Checks for duplicates and creates a new column labeling which are duplicates
df3["duplicate"] = df3.duplicated(subset="Address", keep=False)
# Removes all but the first duplicate in the Address column
df3 = df3.drop_duplicates(subset="Address")

# Writes df3 to Excel file
df3.to_excel("C:/folder/outcomefile.xlsx", index=False)

注意：这仅在两个文件具有相同列

时才有效

如果要检查myfile中的地址是否在第二个文件中重复，请将重复列添加到myfile并打印到新的Excel文件：

import pandas as pd

# Reads myfile Excel file to new data frame
df1 = pd.read_excel("C:/folder/myfile.xlsx")
# Reads second file (a .csv in this example) to new data frame
df2 = pd.read_csv("C:/folder/secondfile.csv")

# Checks if addresses in myfile (df1) are duplicated in second file (df2)
# Then adds duplicate column to myfile
df1["duplicate"] = df1["Address"].isin(df2["Address"])

# Writes edited myfile data frame to new Excel file
df1.to_excel("C:/folder/outcomefile.xlsx", index=False)

注意：两者仅匹配完全地址

希望这有帮助，如果你需要任何不同的东西，请告诉我！

Python - 搜索2个Excel数据表，然后提取数据标签

2 个答案: