如何在python中的特定位置附加输出?

时间:2019-08-15 20:17:03

标签: python pandas dataframe output valueerror

我有一个数据框:

 df
    Out[20]: 
                StreetAddressLine1 StateAbbreviation  ... Longitude BetId
    0                      Unknown           Unknown  ...      None     0
    1         21236 Birchwood Loop                AK  ...      None     1
    2               1731 Bragaw St                AK  ...      None     2
    3               4360 Snider Dr                AK  ...      None     4
    4             9750 W Parks Hwy                AK  ...      None    10
    5            7205 Shorewood Dr                AK  ...      None    11
    6             326 Woodside Ave                AK  ...      None    14
    7  2036 E Northern Lights Blvd                AK  ...      None    15
    8              1600 E Tudor Rd                AK  ...      None    16
    9        1130, 2545 E Tudor Rd                AK  ...      None    17

我运行我的代码对这些地址进行地理编码:

input_file_path = "df"
output_file_path = "output"  # appends "####.csv" to the file name when it writes the file.

# Set the name of the column indexes here so that pandas can read the CSV file
address_column_name = "StreetAddressLine1"
state_column_name = "StateAbbreviation"
zip_column_name = "ZipCode"   # Leave blank("") if you do not have zip codes

# Where the program starts processing the addresses in the input file
# This is useful in case the computer crashes so you can resume the program where it left off or so you can run multiple
# instances of the program starting at different spots in the input file
start_index = 0
# How often the program prints the status of the running program
status_rate = 100
# How often the program saves a backup file
write_data_rate = 1000
# How many times the program tries to geocode an address before it gives up
attempts_to_geocode = 3
# Time it delays each time it does not find an address
# Note that this is added to itself each time it fails so it should not be set to a large number
wait_time = 3

# ----------------------------- Processing the input file -----------------------------#

#df = pd.read_csv(input_file_path, low_memory=False,encoding="utf-8")
# df = pd.read_excel(input_file_path)

# Raise errors if the provided column names could not be found in the input file
if address_column_name not in df.columns:
    raise ValueError("Can't find the address column in the input file.")
if state_column_name not in df.columns:
    raise ValueError("Can't find the state column in the input file.")

# Zip code is not needed but helps provide more accurate locations
if (zip_column_name):
    if zip_column_name not in df.columns:
        raise ValueError("Can't find the zip code column in the input file.")
    addresses = (df[address_column_name] + ', ' + df[zip_column_name].astype(str) + ', ' + df[state_column_name]).tolist()
else:
    addresses = (df[address_column_name] + ', ' + df[state_column_name]).tolist()


# ----------------------------- Function Definitions -----------------------------#

# Creates request sessions for geocoding
class GeoSessions:
    def __init__(self):
        self.Arcgis = requests.Session()
        self.Komoot = requests.Session()


# Class that is used to return 3 new sessions for each geocoding source
def create_sessions():
    return GeoSessions()


# Main geocoding function that uses the geocoding package to covert addresses into lat, longs
def geocode_address(address, s):
    g = geocoder.arcgis(address, session=s.Arcgis)
    if (g.ok == False):
        g = geocoder.komoot(address, session=s.Komoot)

    return g


def try_address(address, s, attempts_remaining, wait_time):
    g = geocode_address(address, s)
    if (g.ok == False):
        time.sleep(wait_time)
        s = create_sessions()  # It is not very likely that we can't find an address so we create new sessions and wait
        if (attempts_remaining > 0):
            try_address(address, s, attempts_remaining-1, wait_time+wait_time)
    return g


# Function used to write data to the output file
def write_data(data, index):
    file_name = (output_file_path + str(index) + ".csv")
    print("Created the file: " + file_name)
    done = pd.DataFrame(data)
    done.columns = ['Address', 'Lat', 'Long']
    done.to_csv((file_name + ".csv"), sep=',', encoding='utf8')


# Variables used in the main for loop that do not need to be modified 
s = create_sessions()
results = []
failed = 0
total_failed = 0
progress = len(addresses) - start_index

# ----------------------------- Main Loop -----------------------------#

for i, address in enumerate(addresses[start_index:]):
    # Print the status of how many addresses have be processed so far and how many of the failed.
    if ((start_index + i) % status_rate == 0):
        total_failed += failed
        print(
            "Completed {} of {}. Failed {} for this section and {} in total.".format(i + start_index, progress, failed,
                                                                                     total_failed))
        failed = 0

    # Try geocoding the addresses
    try:
        g = try_address(address, s, attempts_to_geocode, wait_time)
        if (g.ok == False):
            results.append([address, "was", "not", "geocoded"])
            print("Gave up on address: " + address)
            failed += 1
        else:
            results.append([address, g.latlng[0], g.latlng[1]])

    # If we failed with an error like a timeout we will try the address again after we wait 5 secs
    except Exception as e:
        print("Failed with error {} on address {}. Will try again.".format(e, address))
        try:
            time.sleep(5)
            s = create_sessions()
            g = geocode_address(address, s)
            if (g.ok == False):
                print("Did not fine it.")
                results.append([address, "was", "not", "geocoded"])
                failed += 1
            else:
                print("Successfully found it.")
                results.append([address, g.latlng[0], g.latlng[1]])
        except Exception as e:
            print("Failed with error {} on address {} again.".format(e, address))
            failed += 1
            results.append([address, e, e, "ERROR"])

    # Writing what has been processed so far to an output file
    if (i%write_data_rate == 0 and i != 0):
        write_data(results, i + start_index)

    # print(i, g.latlng, g.provider)


# Finished
write_data(results, i + start_index + 1)
print("Finished! :)")

我希望最终输出文件中的结果也能反映这些位置的BetId。

我尝试了

results.append(df["BetId"])

结果

Out[40]: 
[['Unknown, Unknown, Unknown', 25.851060000000075, 88.24131000000006],
 ['21236 Birchwood Loop, 99567, AK', 61.408868754342635, -149.48655639165537],
 ['1731 Bragaw St, 99508, AK', 61.204894742714515, -149.80829304403093],
 ['4360 Snider Dr, 99654, AK', 61.58477348398676, -149.34070982806276],
 ['9750 W Parks Hwy, 99652, AK', 61.56803449899039, -149.69619047155058],
 ['7205 Shorewood Dr, 99645, AK', 61.626084461047675, -149.2686871507012],
 ['326 Woodside Ave, 99603, AK', 59.64314849321932, -151.55260136081137],
 ['2036 E Northern Lights Blvd, 99508, AK',
  61.19525951731225,
  -149.8425921931733],
 ['1600 E Tudor Rd, 99507, AK', 61.180762485140605, -149.851269],
 ['1130, 2545 E Tudor Rd, 99507, AK', 61.180918519331485, -149.83301780682672],
 0     0
 1     1
 2     2
 3     4
 4    10
 5    11
 6    14
 7    15
 8    16
 9    17
 Name: BetId, dtype: int64]

但是您可以看到BetId不在lat和long之后的第4列附加。

我也尝试过

write_data(results.append(df["BetId"]), i + start_index + 1)

但是我收到一个错误:

  
    

ValueError:长度不匹配:预期轴包含0个元素,新值包含3个元素

  

如何解决此问题,以便最终输出的csv反映除地理编码以外的原始数据帧中的下注ID

1 个答案:

答案 0 :(得分:1)

现在,您要将Pandas系列附加到一个标准的Python列表中,即两个不同的对象。由于 BetId addresses 的长度相同,因为二者均源自与列相同的数据帧,因此请使用枚举循环变量 i 来索引< em> BetId ,然后将值添加为列表的第4个元素。在处理结果期间执行此操作,不要在以下时间进行

for i, address in enumerate(addresses[start_index:]):
    ...
    results.append([address, g.latlng[0], g.latlng[1], df["BetId"].loc[i]])