使用文件

时间:2016-08-10 18:01:46

标签: python multithreading url asynchronous http-post

我对python一般都是新手。我对一切都了解不多,我决定尝试以异步方式自动处理数据。 找到关于aiohttp并且一切都很好,POST请求是异步完成的,检查服务器我得到了输入。

我的问题是当我尝试使用更大的文件时,有300.000行,每行包含一个post请求(实际上有一个包含50.000行的文件和另一个包含6的文件,包含data1 = datafromfile1& data2 = datafromfile2)

我尝试使用带队列的线程来读取文件,同时发布请求,但它对我不起作用。无论如何,这是我的代码:

import random
import asyncio
import re
import aiohttp
from aiohttp import ClientSession
import queue

headers = [ "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36",
            "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36",
            "Mozilla/5.0 (Windows NT 10.0; WOW64; rv:47.0) Gecko/20100101 Firefox/47.0",
            "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/601.6.17 (KHTML, like Gecko) Version/9.1.1 Safari/601.6.17",
            "Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36",
            "Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko"
]


port = input("Port: ")
thread = input("Thread: ")

async def fetch(url,data1,data2):
    url = "https://" + url + ":%s"%str(port)+"/?&d1="+data1 +"&d2=" + data3 + "&data3=testing"
    header = { "User:Agent" : random.choice(headers)}
    async with ClientSession(connector=aiohttp.TCPConnector(verify_ssl=False)) as session:
            try:
                #with aiohttp.Timeout(10):
                async with session.post(url,headers=header) as response:
                    test = response.status
                    if test == 200:
                        print ("POSTED")
                    else:
                        print ("WRONG [%d]"%test )
                    return test
            except Exception as e:
                print ("ERROR [%s]"%type(e).__name__  )
                return 0


async def bound_fetch(sem, url,data1,data2):
    # getter function with semaphore
    async with sem:
        await fetch(url,data1,data2)





async def run(tuples):

    tasks = []
    # create instance of Semaphore
    sem = asyncio.Semaphore(int(thread))
    for (url,data1,data2) in tuples:
        # pass Semaphore to every GET request
        task = asyncio.ensure_future(bound_fetch(sem,url,data1,data2))
        tasks.append(task)

    responses = asyncio.gather(*tasks)
    await responses

这是我读取数据并处理它的地方。我正在创建一个带有url和帖子的元组,以防我决定更改post.php或添加更多内容。

其余代码:

def main():
    global tuples
    with open("data1.txt") as log_file:
            loop = asyncio.get_event_loop()
            for line in log_file:
                tuples = []
                with open("data2.txt") as data_file:
                    for data2 in data_file:
                        data2 = data2.strip('\n')
                        tuples.append((url,data1,data2))
                future = asyncio.ensure_future(run(tuples))
                #why doesn't it stop after the 
                #first 6(one data from a file, 6 datas from the other)
                #links done?
                loop.run_until_complete(future)
                print ("Done a batch.")
            loop.close()

main()

我很确定我不理解aiohttp上的文档。有什么像thread.join这样的函数就像是来自线程的函数,以便在run_untill_complete完成它的工作之前停止阅读,直到我真的得到一个回复​​?我不太了解Future Object。

另外,我在Debian Linux上得到OSError,我的虚拟机在一段时间后测试脚本。它说名称或服务器是未知的,即使它工作,一些请求得到这个。有什么建议?谢谢!

0 个答案:

没有答案