1. LZW基础概念

之前提到的算术编码、霍夫曼编码等技术集中在消除编码的冗余上,而本文要讲的LZW编码是一种针对空间冗余的无误差压缩方法。

LZW算法o又叫“串表压缩算法”,就是通过建立一个将字符串和其对应的记号构成的表(把已经出现过的字符串映射到记号上),用较短的代码来表示较长的字符串来实现压缩。

需要注意的是,LZW算法中字符串和记号的对应关系是在压缩的过程中动态生成的,并且隐含在压缩数据中,解压的时候也是一步一步还原编码并动态生成字典的过程。

2. LZW算法详解

2.1 LZW编码算法

编码器从字符串中不断地读入新的字符,并且将字符或字符串映射到新的记号,以动态地构建编码字典。在编码过程中,我们维护两个变量,即previous_char(表示目前已有的,但未被编码的字符)和current_char(表示刚刚读入的字符)。编码算法流程如下

  • (1)初始状态,previous_charcurrent_char都是空的。
  • (2)读入新的字符current_char,并于previous合并成新的字符串p_and_c(previous_char+current_car)
  • (3)在编码字典里查找p_and_c,如果:
    • p_and_c在字典里,previous_char = p_and_c
    • p_and_c不在字典里,将previous_char的标记输出;在字典中建立p_and_c的映射;更新previous_char = current_char
  • (4)返回步骤(2),重复直至读完字符串。

2.2 Example

ababcababac

初始状态字典里有三个默认的映射

String Symbol
a 0
b 1
c 2

开始编码

previous_char current_char p_and_c p_and_c in dict Action output
"" a a Y previous_char = a -
a b ab N 添加映射{'ab', 3} previous_char = b 0
b a ba N 添加映射{'ba', 4} 1
a b ab Y previous_char = ab -
ab c abc N 添加映射{'abc', 5} previous_char = c 3
c a ca N 添加映射{'ca', 6} previous_char = a 2
a b ab Y previous_char = ab -
ab a aba N 添加映射{'aba', 7} previsou_char = a 3
a b ab Y previous_char = ab -
ab a aba Y previous_char = aba -
aba c abac N 添加映射{'abac', 8} previous_char = c 7
c - - - - 2

编码结束,输出结果为0132372,编码过程中生成的字典为:

String Symbol
a 0
b 1
c 2
ab 3
ba 4
abc 5
ca 6
aba 7
abac 8

2.2 LZW解码算法

解码器输入的是压缩后的数据。同时,我们将维护两个变量previous_symbol(目前已有的但未被处理的记号),current_symbol(当前读入的记号)。算法流程如下

  • (1)初始状态,previous_symbolcurrent_symbol都为空。
  • (2)读入第一个记号,并赋给current_symbol,并解码直接输出。因为第一个字符为单个字符,一定可以直接解码。
  • (3)赋值,previous_symbol = current_symbol
  • (4)在字典里查找current_symbol,如果:

    • current_symbol在字典里:
    • a. 将current_symbol解码的结果输出
    • b. previous_char = dict.get(previous_symbol), current_char = dict.get(current_symbol)[0] 。注意,这里current_charcurrent_symbol解码的第一个字符。
    • c. 将p_and_c添加到新的映射

    • current_symbol不在字典里:
    • a. previous_char = dict.get(previous_symbol), current_char = dict.get(previous_symbol)[0] 。注意,这里current_charprevious_symbol解码的第一个字符。
    • b. 将p_and_c添加到新的映射
    • c. 输出p_and_c
  • 返回步骤(3),直至读完所有记号。

2.4 Example

0 1 3 2 3 7 2

初始状态字典里有三个默认的映射

Symbol String
0 a
1 b
2 c

开始解码:

首先读取第一个记号,current_symbol = 0,输出解码结果dict.get(current_symbol),即a赋值previous_symbol = current_symbol,然后循环读取之后的记号。

previous_symbol current_symbol current_symbol in dict Action output
0 1 Y previous_char = a current_char = b p_and_c = 'ab' 添加映射{3, 'ab'} b
1 3 Y previous_char = b current_char = a p_and_c = 'ba' 添加映射{4. 'ba'} ab
3 2 Y previous_char = ab current_char = c p_and_c = 'abc' 添加映射{5, 'abc'} c
2 3 Y previous_char = c current_char = a p_and_c = ca 添加映射{6. 'ca'} ab
3 7 N previous_char = ab current_char = a p_and_c = aba 添加映射{7, 'aba'} aba
7 2 Y previous_char = aba current_char = c P_and_c = abac 添加映射{8, 'abac'} c

解码结束,输出结果为ababcababac。解码过程中生成的字典为

Symbol String
0 a
1 b
2 c
3 ab
4 ba
5 abc
6 ca
7 aba
8 abac

2.5 代码实现(python)

def encode_init():
    k = 0
    char_num_dict = {}
    for i in range(97,123):
        char_num_dict[chr(i)] = k
        k += 1

    # print(char_num_dict.items())
    return char_num_dict

def decode_init():
    k = 0
    num_char_dict = {}
    for i in range(97,123):
        num_char_dict[str(k)] = chr(i)
        k += 1

    # print(char_num_dict.items())
    return num_char_dict

def LZW_encode(strList, char_num_dict):
    symbol = 26     # 标记
    previous_char = ""      # 已有的,但未被编码的字符
    current_char = ""       # 当前读入的字符
    p_and_c = ""
    output = ""             # 输出的编码序列
    for s in strList:
        current_char = s
        p_and_c = previous_char + current_char
        # print(previous_char, s)
        if char_num_dict.get(p_and_c) != None:
            previous_char = p_and_c
        else:
            output += str(char_num_dict.get(previous_char)) + '-'
            char_num_dict[p_and_c] = symbol
            symbol += 1 
            previous_char = current_char
    output += str(char_num_dict[s[-1]])

    print(char_num_dict.items())
    return output

def LZW_decode(output, num_char_dict):
    output_list = output.split('-')
    print(output_list)

    symbol = 26
    previous_symbol = ""
    current_symbol = ""

    previous_char = ""
    current_char = ""

    decodeList = ""
    decodeList += num_char_dict.get(output_list[0])
    previous_symbol = output_list[0] 

    for o in output_list[1:]:
        current_symbol = o 
        # print(previous_symbol, current_symbol)
        if num_char_dict.get(current_symbol) != None:
            decodeList += num_char_dict.get(current_symbol)

            previous_char = num_char_dict.get(previous_symbol)
            current_char = num_char_dict.get(current_symbol)[0]
            num_char_dict[str(symbol)] = previous_char + current_char
            symbol += 1 
        else:
            previous_char = num_char_dict.get(previous_symbol)
            current_char = num_char_dict.get(previous_symbol)[0]
            num_char_dict[str(symbol)] = previous_char + current_char
            symbol += 1

            decodeList += previous_char + current_char
        previous_symbol = current_symbol

    # print(num_char_dict.items())
    return decodeList

if __name__ == '__main__':
    strList = input("请输入字符序列:").lower()
    print('字符序列长度:', len(strList))
    char_num_dict = encode_init()
    output = LZW_encode(strList, char_num_dict)
    print(output)

    num_char_dict = decode_init()
    # print(num_char_dict.items())
    decodeList = LZW_decode(output, num_char_dict)
    print(decodeList)

    print("编码长度:", len(output) / 2)
Last modification:June 28th, 2020 at 11:39 am
如果觉得我的文章对你有用,请随意赞赏