这可能已经得到了解答,如果是这样,请通过链接指引我到该解决方案页面。
我所拥有的是一个包含100 largest countries by total area (land and water surface)的详细信息的文件:
('1','Russia','17,098,242(6,601,668)','Asia/Europe','Azerbaijan, Belarus, China, Estonia, Finland, Georgia, Kazakhstan, Latvia, Lithuania, Mongolia, North Korea, Norway, Poland, Ukraine')
('2','Canada','9,984,670(3,855,100)','North America','United States')
('3','United States(incl. overseas territories)','9,857,348(3,805,943)','North America','Canada, Mexico')
('4','China','9,596,961(3,705,407)','Asia','Afghanistan, Bhutan, India, Kazakhstan, Kyrgyzstan, Laos, Mongolia, Myanmar, Nepal, North Korea, Pakistan, Russia, Tajikistan, Vietnam')
('5','Brazil','8,515,770(3,287,957)','South America','Argentina, Bolivia, Colombia, France (French Guiana), Guyana, Paraguay, Peru, Suriname, Uruguay, Venezuela'),
....
....
是的,输入文件在行的开头和结尾都有(&)。
任何帮助都将非常感激。
到目前为止,我试图通过写作来获得这个:
onlyCountries = 'allcountries.txt'
print([x.split(',')[1] for x in open(onlyCountries)])
但是这给了我输出:
["'Russia'", "'Canada'", "'United States(incl. overseas territories)'", "'China'", "'Brazil'"...]
请注意我从上面给出的输入文件示例中获得的额外双引号?我想得到输出:
['Russia','Canada','United States','China','Brazil',....]
答案 0 :(得分:2)
你可以用这样的熊猫来获取它:
import pandas as pd
df = pd.read_html("https://www.countries-ofthe-world.com/largest-countries.html" ,header=0, index_col=0)[0]
clist = df.Country.str.replace(r"\(.*", "").tolist()
print clist
输出:
[u'Russia', u'Canada', u'United States ', u'China', u'Brazil', u'Australia ', u'India', u'Argentina', u'Kazakhstan', u'Algeria', u'Democratic Republic of the Congo', u'Denmark ', u'Saudi Arabia', u'Mexico', u'Indonesia', u'Sudan', u'Libya', u'Iran', u'Mongolia', u'Peru', u'Chad', u'Niger', u'Angola', u'Mali', u'South Africa', u'Colombia', u'Ethiopia', u'Bolivia', u'Mauritania', u'Egypt', u'Tanzania', u'Nigeria', u'Venezuela', u'Namibia', u'Mozambique', u'Pakistan', u'Turkey', u'Chile', u'Zambia', u'Myanmar', u'Afghanistan', u'France ', u'Somalia', u'Central African Republic', u'South Sudan', u'Ukraine', u'Madagascar', u'Botswana', u'Kenya', u'Yemen', u'Thailand', u'Spain', u'Turkmenistan', u'Cameroon', u'Papua New Guinea', u'Sweden', u'Uzbekistan', u'Morocco', u'Iraq', u'Paraguay', u'Zimbabwe', u'Japan', u'Germany', u'Republic of the Congo', u'Finland ', u'Vietnam', u'Malaysia', u'Norway ', u"Cote d'Ivoire", u'Poland', u'Oman', u'Italy', u'Philippines', u'Ecuador', u'Burkina Faso', u'New Zealand ', u'Gabon', u'United Kingdom ', u'Guinea', u'Uganda', u'Ghana', u'Romania', u'Laos', u'Guyana', u'Belarus', u'Kyrgyzstan', u'Senegal', u'Syria', u'Cambodia', u'Uruguay', u'Suriname', u'Tunisia', u'Nepal', u'Bangladesh', u'Tajikistan', u'Greece', u'Nicaragua', u'North Korea', u'Malawi', u'Eritrea']
答案 1 :(得分:0)
countries = []
with open('text.txt', 'r') as f:
for line in f.readlines():
country = line.split(',')[1]
countries.append(country)
print(countries)