我是一个绝对的初学者,但我已经设法用一些现有的脚本和教程制作一个工作脚本。我想要的只有一件事,不幸的是我无法做到。
到目前为止,我从一个网站获取数据,例如“http://www.example.com/01536496/.../”。现在我在第一列中有一个列表(.csv或.txt),其中包含许多其他数字(或者在txt文件中,新行中的每个数字)。现在我想抓取列表中所有数字的网络数据,因此“http://www.example.com/No_1/.../”,“http://www.example.com/No_2/.../”等等。
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
import datetime
my_url = 'http://www.example.com/104289633/.../'
uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()
page_soup = soup(page_html, "html.parser")
...
更新
例如我有一个numbers.txt:05543486 3468169 36189994
现在我想将每个号码放入网址......
请有人帮助我。我将非常感激。
更新
尝试使用Andersson的代码后......
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
import datetime
# Get list of numbers
with open("numbers.txt") as f:
content = f.read()
numbers = content.split()
# Handle each URL in a loop
for number in numbers:
my_url = 'https://www.immobilienscout24.de/expose/%s#/' %number
uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()
print(my_url)
page_soup = soup(page_html, "html.parser")
containers = page_soup.find_all("div", {"class":"grid-item padding-desk-right-xl desk-two-thirds lap-one-whole desk-column-left flex-item palm--flex__order--1 lap--flex__order--1"})
filename = "results_"+current_datetime+".csv"
f = open(filename, "w")
headers = "titel##adresse##criteria##preis##energie##beschreibung##ausstattung##lage\n"
f.write(headers)
...
f.write(titel + "##" + adresse + "##" + criteria.replace(" ", "; ") + "##" + preis.replace(" ", "; ") + "##" + energie.replace(" ", "; ") + "##" + beschreibung.replace("\n", " ") + "##" + ausstattung.replace("\n", " ") + "##" + lage.replace("\n", " ") + "\n")
f.close()
答案 0 :(得分:0)
您可以创建一个运行for循环的函数,并通过循环更新每次迭代的url。作为参数,您可以传递数字列表。例如:
def scrape(numbers):
for num in numbers:
my_url = 'http://www.example.com/No_' + str(num) + '/.../'
uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()
page_soup = soup(page_html, "html.parser")
numbers_list = [1, 2, 3, 4, 5]
scrape(numbers_list)
答案 1 :(得分:0)
你可以通过在网址的末尾添加一个基本的for循环来实现这个目的吗?我不确定这是否是你需要的。
let strTemp = "Hii \\xF0\\x9F\\x98\\x81"
let regex = try! NSRegularExpression(pattern: "[0-9a-f]{1,2}", options: .caseInsensitive)
// get all matched hex xF0 , x9f,..etc
let matches = regex.matches(in: strTemp, options: [], range: NSMakeRange(0, strTemp.count))
// Data that will hanlde convert hex to UTf8
var emijoData = Data(capacity: strTemp.count / 2)
matches.enumerated().forEach { (offset , check) in
let byteString = (strTemp as NSString).substring(with: check.range)
var num = UInt8(byteString, radix: 16)!
emijoData.append(&num, count: 1)
}
let subStringEmijo = String.init(data: emijoData, encoding: String.Encoding.utf8)!
//now we have your emijo text we can replace by its code from string using matched ranges `first` and `last`
// All range range of \\xF0\\x9F\\x98\\x81 in "Hii \\xF0\\x9F\\x98\\x81" to replce by your emijo
if let start = matches.first?.range.location, let end = matches.last?.range.location , let endLength = matches.last?.range.length {
let startLocation = start - 2
let length = end - startLocation + endLength
let sub = (strTemp as NSString).substring(with: NSRange.init(location: startLocation, length: length))
print( strTemp.replacingOccurrences(of: sub, with: subStringEmijo))
// Hii
}
答案 2 :(得分:0)
您可以使用以下代码:
{{1}}
答案 3 :(得分:0)
您可以通过各种方式迭代文件行,但我认为最干净的是使用pandas
。
你只需要这样做:
import pandas as pd
df = pd.read_csv("filename.csv")
# assuming that filename.csv's first line has a header called "Numbers"
# You can apply a function `func` to each element of the column via `map`
df['Numbers'].map(func)
使用熊猫' map
函数,我们可以将每个值传递给函数来创建我们的URL。
# First of all, we define this function
def numberToUrl(number):
# We can use python's `string.format()` to format a string
return 'http://www.example.com/{}/.../'.format(number)
# Then we can pass this function to each value with `map`
# and assign the result to a new column
df['url'] = df['Numbers'].map(numberToUrl)
# We can print the first 5 elements via:
df.head()
如您所见,将功能传递到每一行非常简单 如果你想迭代行,你可以这样做:
for (index, row) in df['url'].iteritems():
# Do your operations here
在你的情况下,它会是这样的:
for (index, row) in df['url'].iteritems():
uClient = uReq(row)
page_html = uClient.read()
uClient.close()
page_soup = soup(page_html, "html.parser")
# ...
我不建议直接使用urllib.request
。相反,您可以使用名为requests