我的目标是读取一个较大的csv文件并打印出所有类似的值,因为这与酒店有关,并且为了使其变得简单,我将在此处为该代码列出字典:
S1 = [{'name': 'Holiday Inn A','price': '552'},
{'name': 'Holiday Inn B','price': '568'},
{'name': 'Holiday Inn C','price': '589'},
{'name': 'Grand Palace','price': '768'}
and so on...]
我的意思是我想打印出所有带有“假日旅馆”名称的值,这是我想要的结果:
Holiday Inn A
Holiday Inn B
Holiday Inn C
这是我的代码:
import csv
name = []
value = []
linked = []
a = []
def filereader():
line_count = 0
with open('hotelRev.csv','r', encoding ='utf-8') as fileIn:
reader = csv.reader(fileIn)
for row in reader:
line_count = line_count + 1
if line_count == 1:
name.append(row)
else:
value.append(row)
for x in name:
for y in value:
linked.append(dict(zip(x,y)))
filereader()
for row in linked:
a.append(row['name'])
b = sorted(set(a))
for row in linked:
print(row['name']['Holiday Inn'])
很显然,这是行不通的,有人知道怎么做吗?
edit-1:类似地,我的意思是将所有Holiday Inn元素归为一个大类,以便更容易调用和打印。 来自数据集本身的直接示例:
Holiday Inn Express & Suites Austin South
Holiday Inn Express & Suites Baton Rouge East
Holiday Inn Express & Suites Bethlehem
Holiday Inn Express & Suites Bloomington
Holiday Inn Express & Suites Butte
Holiday Inn Express & Suites Carmel-north Indianapolis
Holiday Inn Express & Suites Carpinteria
Holiday Inn Express & Suites Columbus - Polaris Parkway
Holiday Inn Express & Suites Columbus Univ Area - Osu
Holiday Inn Express & Suites Denver Northeast - Brighton
如果可能的话,我很想找到一种以尽可能少的行打印出来的方法
答案 0 :(得分:1)
这是使用集的基本解决方案。我认为对于非常大的数据集来说效率不高,但可以参考它来创建有效的解决方案。
import pandas as pd
import re
df = pd.read_csv('HotelNames.csv')
search_terms = input('Enter search terms: ')
#Convert to lower case
search_terms = search_terms.lower()
#Remove special characters except space
search_terms = re.sub(r"[^a-zA-Z0-9]+", ' ', search_terms)
#Make a list of words from the string
temp = search_terms.split(' ')
search_set = set()
for i in range(len(temp)):
#Make a set of unique words
search_set.add(temp[i])
for i in range(len(df)):
t = re.sub(r"[^a-zA-Z0-9]+", ' ', df.iloc[i][0])
t = t.lower()
temp = t.split(' ')
hotel_set = set()
for j in range(len(temp)):
hotel_set.add(temp[j])
#Find whether the searched terms are a subset of the hotel name in that particular row
if(search_set.issubset(hotel_set)):
print(df.iloc[i][0])
HotelNames.csv
当前包含1列,即酒店名称。
答案 1 :(得分:0)
我认为所缺少的是您对相似含义的确切定义。我建议您编写一个函数或方法,如果两个字符串与您定义的相似字符串匹配,则该函数或方法返回布尔值true。一旦解决了这个问题,其余的应该使用if语句。
一些测试字符串供您考虑..(您确定它们是否相似以及为什么)
“假日旅馆” “假日在” “假日酒店” “假日酒店” “ 假日酒店” “假日旅馆” “ ^ * $%__假日旅馆!” “旧金山假日套房酒店” 等等
您可能希望了解并熟悉的一件事是语音距离的概念。这是用于此的Python库。.https://github.com/jamesturk/jellyfish
答案 2 :(得分:0)
根据您的评论,听起来您的数据实际上是相当结构化的(尽管有些条目可能需要一些编辑才能清除它们)。您可以采用的一种方法是查看具有通用前缀的名称组。
我将使用单词代替字符,并使用dict + helper函数代替类,以使其尽可能简单。
ITEMS_KEY = None # Anything that isn't a string is safe here
def add_item(lookup, item):
words = item.split()
# e.g.: ["Holiday", "Inn", "Express", "&", "Suites", "Austin", "South"]
for word in words:
# Add a lower level lookup if needed
lookup = lookup.setdefault(word, {})
# Making this word.lower() makes searches case-insensitive
lookup.setdefault(ITEMS_KEY, set()).add(item) # Add the full item
# e.g.:
# lookup = {"Holiday": {"Inn": {"Express": ... {"South: {None: set(["Holiday Inn Express ..."]}}
def get_items_matching_prefix(lookup, prefix):
# Simple version with full words only
# Find the tree of results
words = prefix.split()
for word in words:
lookup = lookup.get(word, {})
return all_values(lookup)
def all_values(lookup):
# Collect all the results
ret = set()
for k, v in lookup.iteritems():
if k == ITEMS_KEY:
ret.update(v)
else:
ret.update(all_values(v))
return ret
csv_data = [
"Holiday Inn Express & Suites Austin South",
"Holiday Inn Express & Suites Austin North",
"Holiday Inn Oriental Express",
]
lookup = {} # Could be a class or recursive collections.defaultdict
for row in csv_data:
# e.g. row = "Holiday Inn Express & Suites Austin South"
add_item(lookup, row)
print get_items_matching_prefix(lookup, "Holiday Inn")
# set(['Holiday Inn Oriental Express', 'Holiday Inn Express & Suites Austin North', 'Holiday Inn Express & Suites Austin South'])
print get_items_matching_prefix(lookup, "Holiday Inn Express")
# set(['Holiday Inn Express & Suites Austin North', 'Holiday Inn Express & Suites Austin South'])
一种更高级的技术可能是尝试搜索一组常见的子字符串,确定它们是酒店,然后将您的csv解析为“酒店+元数据”,添加一些链->酒店地图,并使用该更丰富的数据代替。 / p>