python 3.6从列表中获取文本

时间:2018-05-18 12:37:22

标签: python web-scraping

我是一个绝对的初学者,但我已经设法用一些现有的脚本和教程制作一个工作脚本。我想要的只有一件事,不幸的是我无法做到。

到目前为止,我从一个网站获取数据,例如“http://www.example.com/01536496/.../”。现在我在第一列中有一个列表(.csv或.txt),其中包含许多其他数字(或者在txt文件中,新行中的每个数字)。现在我想抓取列表中所有数字的网络数据,因此“http://www.example.com/No_1/.../”,“http://www.example.com/No_2/.../”等等。

from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
import datetime

my_url = 'http://www.example.com/104289633/.../'

uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()

page_soup = soup(page_html, "html.parser")

...

更新

例如我有一个numbers.txt:05543486 3468169 36189994

现在我想将每个号码放入网址......

请有人帮助我。我将非常感激。

更新

尝试使用Andersson的代码后......

from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
import datetime

# Get list of numbers
with open("numbers.txt") as f:
    content = f.read()
    numbers = content.split()

# Handle each URL in a loop
for number in numbers:
    my_url = 'https://www.immobilienscout24.de/expose/%s#/' %number

uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()

print(my_url)

page_soup = soup(page_html, "html.parser")

containers = page_soup.find_all("div", {"class":"grid-item padding-desk-right-xl desk-two-thirds lap-one-whole desk-column-left flex-item palm--flex__order--1 lap--flex__order--1"})

filename = "results_"+current_datetime+".csv"
f = open(filename, "w")

headers = "titel##adresse##criteria##preis##energie##beschreibung##ausstattung##lage\n"

f.write(headers)

...

    f.write(titel + "##" + adresse + "##" + criteria.replace("    ", "; ") + "##" + preis.replace("    ", "; ") + "##" + energie.replace("    ", "; ") + "##" + beschreibung.replace("\n", " ") + "##" + ausstattung.replace("\n", " ") + "##" + lage.replace("\n", " ") + "\n")

f.close()

4 个答案:

答案 0 :(得分:0)

您可以创建一个运行for循环的函数,并通过循环更新每次迭代的url。作为参数,您可以传递数字列表。例如:

def scrape(numbers):
    for num in numbers:
        my_url = 'http://www.example.com/No_' + str(num) + '/.../'

        uClient = uReq(my_url)
        page_html = uClient.read()
        uClient.close()

        page_soup = soup(page_html, "html.parser")


numbers_list = [1, 2, 3, 4, 5]
scrape(numbers_list)

答案 1 :(得分:0)

你可以通过在网址的末尾添加一个基本的for循环来实现这个目的吗?我不确定这是否是你需要的。

 let strTemp = "Hii \\xF0\\x9F\\x98\\x81"


            let regex = try! NSRegularExpression(pattern: "[0-9a-f]{1,2}", options: .caseInsensitive)
            // get all matched hex  xF0 , x9f,..etc

            let matches = regex.matches(in: strTemp, options: [], range: NSMakeRange(0, strTemp.count))


            // Data that will hanlde convert hex to UTf8
            var emijoData = Data(capacity: strTemp.count / 2)

            matches.enumerated().forEach { (offset , check) in
                let byteString = (strTemp as NSString).substring(with: check.range)
                var num = UInt8(byteString, radix: 16)!
                emijoData.append(&num, count: 1)
            }

            let subStringEmijo = String.init(data: emijoData, encoding: String.Encoding.utf8)!
            //now we have your emijo text   we can replace by its code from string using matched ranges `first` and `last`

            // All range range of  \\xF0\\x9F\\x98\\x81 in "Hii \\xF0\\x9F\\x98\\x81" to replce by your emijo

            if let start = matches.first?.range.location, let end = matches.last?.range.location  , let endLength = matches.last?.range.length {

                let startLocation = start  - 2
                let length = end - startLocation + endLength

                let sub = (strTemp as NSString).substring(with: NSRange.init(location: startLocation, length: length))

                print( strTemp.replacingOccurrences(of: sub, with: subStringEmijo))
              // Hii 

            }

答案 2 :(得分:0)

您可以使用以下代码:

{{1}}

答案 3 :(得分:0)

从csv文件加载

您可以通过各种方式迭代文件行,但我认为最干净的是使用pandas。 你只需要这样做:

import pandas as pd
df = pd.read_csv("filename.csv")

# assuming that filename.csv's first line has a header called "Numbers"
# You can apply a function `func` to each element of the column via `map`
df['Numbers'].map(func)

来自Numbers的网址

使用熊猫' map函数,我们可以将每个值传递给函数来创建我们的URL。

# First of all, we define this function
def numberToUrl(number):
    # We can use python's `string.format()` to format a string
    return 'http://www.example.com/{}/.../'.format(number)

# Then we can pass this function to each value with `map`
# and assign the result to a new column
df['url'] = df['Numbers'].map(numberToUrl)

# We can print the first 5 elements via:
df.head()

如您所见,将功能传递到每一行非常简单 如果你想迭代行,你可以这样做:

for (index, row) in df['url'].iteritems():
    # Do your operations here

在你的情况下,它会是这样的:

for (index, row) in df['url'].iteritems():
    uClient = uReq(row)
    page_html = uClient.read()
    uClient.close()

    page_soup = soup(page_html, "html.parser")
    # ...

附加说明

我不建议直接使用urllib.request。相反,您可以使用名为requests

的包装库