Question

我有一个脚本，该脚本可让我将从excel中获得的信息提取到一个列表中，该列表包含str值，这些值包含诸如“我喜欢烹饪”，“我的狗的名字是道格”之类的短语。

所以我尝试了在Internet上找到的这段代码，因为知道int函数可以将实际短语转换为数字。

我使用的代码是：

lista=["I like cooking", "My dog´s name is Doug", "Hi, there"]

test_list = [int(i, 36) for i in lista]

运行代码时出现以下错误：

builtins.ValueError：int（）的无效文字，基数为36：“我喜欢烹饪”

但是我尝试了不带空格或标点符号的代码，并且获得了实际值，但是我确实需要考虑这些字符。

Answer 1

要扩展bytearray方法，您可以使用int.to_bytes和int.from_bytes实际返回一个int，尽管整数将比示例中显示的长得多。

def to_int(s):
    return int.from_bytes(bytearray(s, 'utf-8'), 'big', signed=False)

def to_str(s):
    return s.to_bytes((s.bit_length() +7 ) // 8, 'big').decode()

lista = ["I like cooking",
            "My dog´s name is Doug",
            "Hi, there"]

encoded = [to_int(s) for s in lista]

decoded = [to_str(s) for s in encoded]

已编码：

[1483184754092458833204681315544679,
 28986146900667755422058678317652141643897566145770855,
 1335744041264385192549]

已解码：

['I like cooking',
 'My dog´s name is Doug',
 'Hi, there']

Answer 2

如评论中所述，如果短语包含空格或少数例外的大多数非字母数字字符，则无法使用int()将短语转换为整数。

如果所有短语都使用通用编码，那么通过将字符串转换为字节数组，您可能会更接近想要的短语。例如：

s = 'My dog´s name is Doug'

b = bytearray(s, 'utf-8')
print(list(b))
# [77, 121, 32, 100, 111, 103, 194, 180, 115, 32, 110, 97, 109, 101, 32, 105, 115, 32, 68, 111, 117, 103]

从那里，您将不得不确定是要保留表示每个短语的整数列表还是要以某种方式组合它们，这取决于您打算使用这些数字字符串表示形式。

Answer 3

由于您想要将文本转换为AI，因此应该执行以下操作：

import re

def clean_text(text, vocab):
    '''
    normalizes the string
    '''
    chars = {'\'':[u"\u0060", u"\u00B4", u"\u2018", u"\u2019"], 'a':[u"\u00C0", u"\u00C1", u"\u00C2", u"\u00C3", u"\u00C4", u"\u00C5", u"\u00E0", u"\u00E1", u"\u00E2", u"\u00E3", u"\u00E4", u"\u00E5"],
                'e':[u"\u00C8", u"\u00C9", u"\u00CA", u"\u00CB", u"\u00E8", u"\u00E9", u"\u00EA", u"\u00EB"],
                'i':[u"\u00CC", u"\u00CD", u"\u00CE", u"\u00CF", u"\u00EC", u"\u00ED", u"\u00EE", u"\u00EF"],
                'o':[u"\u00D2", u"\u00D3", u"\u00D4", u"\u00D5", u"\u00D6", u"\u00F2", u"\u00F3", u"\u00F4", u"\u00F5", u"\u00F6"],
                'u':[u"\u00DA", u"\u00DB", u"\u00DC", u"\u00DD", u"\u00FA", u"\u00FB", u"\u00FC", u"\u00FD"]}

    for gud in chars:
        for bad in chars[gud]:
            text = text.replace(bad, gud)

    if 'http' in text:
        return ''

    text = text.replace('&', ' and ')
    text = re.sub(r'\.( +\.)+', '..', text)
    #text = re.sub(r'\.\.+', ' ^ ', text)
    text = re.sub(r',+', ',', text)
    text = re.sub(r'\-+', '-', text)
    text = re.sub(r'\?+', ' ? ', text)
    text = re.sub(r'\!+', ' ! ', text)
    text = re.sub(r'\'+', "'", text)
    text = re.sub(r';+', ':', text)
    text = re.sub(r'/+', ' / ', text)
    text = re.sub(r'<+', ' < ', text)
    text = re.sub(r'>+', ' > ', text)
    text = text.replace('%', '% ')
    text = text.replace(' - ', ' : ')
    text = text.replace(' -', " - ")
    text = text.replace('- ', " - ")
    text = text.replace(" '", " ")
    text = text.replace("' ", " ")

    #for c in ".,:":
    #   text = text.replace(c + ' ', ' ' + c + ' ')

    text = re.sub(r' +', ' ', text.strip(' '))

    for i in text:
        if i not in vocab:
            text = text.replace(i, '')

    return text

def arr_to_vocab(arr, vocabDict):
    '''
    returns a provided array converted with provided vocab dict, all array elements have to be in the vocab, but not all vocab elements have to be in the input array, works with strings too
    '''
    try:
        return [vocabDict[i] for i in arr]

    except Exception as e:
        print (e)
        return []

def str_to_vocab(vocab):
    '''
    generates vocab dicts 
    '''
    to_vocab = {}
    from_vocab = {}

    for index, i in enumerate(vocab):
        to_vocab[index] = i
        from_vocab[i] = index

    return to_vocab, from_vocab

vocab = sorted([chr(i) for i in range(32, 127)]) # a basic vocab for your model
vocab.insert(0, None)

toVocab, fromVocab = str_to_vocab(vocab) #converting vocab into usable form

your_data_str = ["I like cooking", "My dog´s name is Doug", "Hi, there"] #your data, a list of strings

X = []

for i in your_data_str:
    X.append(arr_to_vocab(clean_text(i, vocab), fromVocab)) # normalizing and converting to "ints" each string

# your data is now almost ready for your model, just pad it to the size of your input with zeros and it's done

print (X)

如果您想知道如何将“ int”字符串转换回字符串，请告诉我。

如何将具有词组的str列表转换为int列表？

3 个答案: