我有一个巨大的HTML表(大约500,000行),我需要将其转换为JSON文件。 该表看起来像这样:
<table>
<tr>
<th>Id</th>
<th>Timestamp</th>
<th>Artist_Name</th>
<th>Tweet_Id</th>
<th>Created_at</th>
<th>Tweet</th>
<th>User_name</th>
<th>User_Id</th>
<th>Followers</th>
</tr>
<tr>
<td>1</td>
<td>2013-06-07 16:00:17</td>
<td>Kelly Rowland</td>
<td>343034567793442816</td>
<td>Fri Jun 07 15:59:48 +0000 2013</td>
<td>So has @MissJia already discussed this Kelly Rowland Dirty Laundry song? I ain't trying to go all through her timelime...</td>
<td>Nicole Barrett</td>
<td>33831594</td>
<td>62</td>
</tr>
<tr>
<td>2</td>
<td>2013-06-07 16:00:17</td>
<td>Kelly Rowland</td>
<td>343034476395368448</td>
<td>Fri Jun 07 15:59:27 +0000 2013</td>
<td>RT @UrbanBelleMag: While everyone waits for Kelly Rowland to name her abusive ex, don't hold your breath. But she does say he's changed: ht…</td>
<td>A.J.</td>
<td>24193447</td>
<td>340</td>
</tr>
我想创建一个看起来像这样的JSON文件:
{'data': [
{
'text': 'So has @MissJia already discussed this Kelly Rowland Dirty Laundry song? I ain't trying to go all through her timelime...',
'id': 1,
'tweet_id': 343034567793442816
},
{
'text': 'RT @UrbanBelleMag: While everyone waits for Kelly Rowland to name her abusive ex, don't hold your breath. But she does say he's changed: ht…',
'id': 2,
'tweet_id': 343034476395368448
}
]}
可能包含了更多的变量,但这应该是自我解释。
我已经考虑了几个选项,但大多数情况下我的问题是我的HTML表格太大了。我看到很多人推荐jQuery。考虑到桌子的大小,这对我有意义吗? 如果有一个合适的Python选项,我会非常赞成,因为到目前为止我已经用Python编写了大部分代码。
答案 0 :(得分:0)
以下是示例代码。
var tbl = $('table tr:has(td)').map(function(i, v) {
var $td = $('td', this);
return {
Id: $td.eq(0).text(),
Timestamp: $td.eq().text(),
Artist_Name: $td.eq(2).text(),
Tweet_Id: $td.eq(3).text()
Tweet: $td.eq(4).text()
User_name: $td.eq(5).text()
User_Id: $td.eq(6).text()
Followers: $td.eq(7).text()
}
}).get();
答案 1 :(得分:0)
如果你想使用jquery / javascript,这将会很难处理 - 这是可能的,但使用js处理那么多数据并不是很好。但是,如果你只做了一次,并且你真的决定使用JS - 那么在SO上就会出现类似的问题......
看看这个小提琴......
var myRows = [];
var headersText = [];
var $headers = $("th");
// Loop through grabbing everything
var $rows = $("tbody tr").each(function(index) {
$cells = $(this).find("td");
myRows[index] = {};
$cells.each(function(cellIndex) {
// Set the header text
if(headersText[cellIndex] === undefined) {
headersText[cellIndex] = $($headers[cellIndex]).text();
}
// Update the row object with the header/cell combo
myRows[index][headersText[cellIndex]] = $(this).text();
});
});
// Let's put this in the object like you want and convert to JSON (Note: jQuery will also do this for you on the Ajax request)
var myObj = {
"myrows": myRows
};
alert(JSON.stringify(myObj));
这是个问题 How to convert the following table to JSON with javascript?
如果你需要js的帮助,请问,但这应该让你去。
答案 2 :(得分:0)
使用Python和lxml package,您可以使用
解析HTMLimport lxml.html as LH
import collections
import itertools as IT
import json
Row = collections.namedtuple(
'Row',
'id timestamp artist tweet_id created_at tweet user_name user_id, followers')
filename = '/tmp/test.html'
root = LH.parse(filename)
data = []
result = {'data': data}
for row in IT.starmap(Row, zip(*[iter(root.xpath('//tr/td/text()'))]*9)):
data.append({'text':row.tweet, 'id':row.id, 'tweet_id':row.tweet_id})
with open('/tmp/test.json', 'w') as f:
json.dump(result, f, indent=2)
对于包含500K <tr>
个标签的文件大约需要50秒,并且需要大约910M RAM(将HTML加载到带有root = LH.parse(filename)
的DOM中。)
将json文件加载到Python dict中大约需要2秒钟:
In [14]: time x = json.load(open('/tmp/test.json'))
CPU times: user 1.80 s, sys: 0.04 s, total: 1.84 s
Wall time: 1.85 s