Question

我正在尝试隔离Location列，然后最终将其输出到数据库文件。我的代码如下：

import urllib
import urllib2
from bs4 import BeautifulSoup


url = "http://en.wikipedia.org/wiki/List_of_ongoing_armed_conflicts"
response = urllib2.urlopen(url)
html = response.read()
soup = BeautifulSoup(html)

trs = soup.find_all('td')

for tr in trs:
  for link in tr.find_all('a'):
    fulllink = link.get ('href')

tds = tr.find_all("tr")
location = str(tds[3].get_text())



print location

但我总是得到2个错误中的1个错误列表超出范围或退出代码＆＃39; 0＆＃39;。我不确定beautfulsoup，因为我正在努力学习它所以任何帮助表示赞赏谢谢！

Answer 1

有一种更简单的方法可以找到Location列。使用table.wikitable tr CSS Selector，找到每一行的所有td元素，然后按索引获取第4个td。

此外，如果单元格内有多个位置，则需要单独处理它们：

import urllib2
from bs4 import BeautifulSoup


url = "http://en.wikipedia.org/wiki/List_of_ongoing_armed_conflicts"
soup = BeautifulSoup(urllib2.urlopen(url))

for row in soup.select('table.wikitable tr'):
    cells = row.find_all('td')
    if cells:
        for text in cells[3].find_all(text=True):
            text = text.strip()
            if text:
                print text

打印：

Afghanistan
Nigeria
Cameroon
Niger
Chad
...
Iran
Nigeria
Mozambique

Answer 2

您只需在代码中交换td和tr应答器。并且要小心str()函数，因为您的网页中可能有unicode字符串，无法使用简单的ascii字符串进行转换。你的代码应该是：

import urllib
import urllib2
from bs4 import BeautifulSoup


url = "http://en.wikipedia.org/wiki/List_of_ongoing_armed_conflicts"
response = urllib2.urlopen(url)
html = response.read()
soup = BeautifulSoup(html)

trs = soup.find_all('tr')  # 'tr' instead of td

for tr in trs:
    for link in tr.find_all('a'):
        fulllink = link.get ('href')
        tds = tr.find_all("td")  # 'td' instead of td
        location = tds[3].get_text()  # remove of str function
        print location

voilà!!

试图用美丽的汤分离1列

2 个答案: