我目前正在从不提供json的服务进行webscraping信息。目标是从下面获取代码(小部分)并只抓取当前导入的项目。然后在php中生成一个json数组。
所以例如我想要从第一个条目
House M.D.(第4季)(第1至5集)
0 S_419212
剧院
DVD
6月2日下午6:05
正在进行中,还剩14分钟
下一个条目不是状态='有效'所以跳过它。
示例代码
<tr class="import_row" handle="3f0761be271334a-L1_257" selection_handle="0-S_419212" state="active" utcstart="1464912324">
<td valign="top" class="start_time" nowrap="">Jun. 02, 6:05 pm</td>
<td valign="top" class="title">House M.D. (Season 4) (Episodes 1 - 5)</td>
<td valign="top" class="reader">Theater</td>
<td valign="top" class="type">DVD</td>
<td valign="top" class="status">In progress, 14 minutes left</td>
<td valign="top" class="edit"></td>
</tr>
<tr class="import_row" handle="3f0761be271334a-L1_255" selection_handle="0-S_4c6be1" state="completed" utcstart="1464673067">
<td valign="top" class="start_time" nowrap="">May. 30, 11:37 pm</td>
<td valign="top" class="title"><a href="javascript:getDetails('0-S_4c6be1');">National Treasure 2: Book of Secrets (Feature)</a></td>
<td valign="top" class="reader">Theater</td>
<td valign="top" class="type">DVD</td>
<td valign="top" class="status">Completed in 26 minutes</td>
<td valign="top" class="edit"></td>
</tr>
这也是我用来将信息导入PHP的代码
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, 'http://10.1.1.150/home/index.html');
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, TRUE);
$cookie_file = "cookie.txt";
curl_setopt($ch, CURLOPT_COOKIESESSION, true);
curl_setopt($ch, CURLOPT_COOKIEFILE, $cookie_file);
curl_setopt($ch, CURLOPT_COOKIEJAR, $cookie_file);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$input_lines = curl_exec($ch);
curl_close($ch);