我目前正在将所有美丽的汤代码转换为PHP,以便习惯PHP。但是,我遇到了一些问题,我的PHP代码只有在维基页面有“外部链接”时才会起作用。在html中原始运行之后(例如True Detective Wiki)。我刚刚发现,这种情况永远不会发生,因为可能并不总是存在“外部链接”。部分。我想知道是否还有使用我美丽的汤代码使用的相同技术将我美丽的汤代码转换为PHP代码?
import requests, re
from bs4 import BeautifulSoup
def get_date(url):
r = requests.get(url)
soup = BeautifulSoup(r.content)
date = soup.find_all("table", {"class": "infobox"})
for item in date:
dates = item.find_all("th")
for item2 in dates:
if item2.text == "Original run":
test2 = item2.find_next("td").text.encode("utf-8")
mysub = re.sub(r'\([^)]*\)', '', test2)
return my sub
这是我目前的PHP代码
<?php
// Defining the basic cURL function
function curl($url) {
$ch = curl_init(); // Initialising cURL
curl_setopt($ch, CURLOPT_URL, $url); // Setting cURL's URL option with the $url variable passed into the function
curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE); // Setting cURL's option to return the webpage data
$data = curl_exec($ch); // Executing the cURL request and assigning the returned data to the $data variable
curl_close($ch); // Closing cURL
return $data; // Returning the data from the function
}
?>
<?php
// Defining the basic scraping function
function scrape_between($data, $start, $end){
$data = stristr($data, $start); // Stripping all data from before $start
$data = substr($data, strlen($start)); // Stripping $start
$stop = stripos($data, $end); // Getting the position of the $end of the data to scrape
$data = substr($data, 0, $stop); // Stripping all data from after and including the $end of the data to scrape
return $data; // Returning the scraped data from the function
}
?>
<?php
$scraped_page = curl("http://en.wikipedia.org/wiki/The_Walking_Dead_(TV_series)"); // Downloading IMDB home page to variable $scraped_page
$scraped_data = scrape_between($scraped_page, "<table class=\"infobox vevent\" style=\"width:22em\">", "</table>"); // Scraping downloaded dara in $scraped_page for content between <title> and </title> tags
$original_run = mb_substr($scraped_data, strpos($scraped_data, "Original run")-2, strpos($scraped_data, "External links") - strpos($scraped_data, "Original run")-2);
echo $original_run;
?>
答案 0 :(得分:1)
您是否考虑过使用Wikipedia API?自动生成的wiki标记通常非常难以处理,并且可能随时发生变化。
此外,您只需使用带有编辑器的phpQuery库,而不是尝试使用正则表达式解析HTML或其他内容,您只需搜索选择器table.infobox.vevent
。