我有一个充满大量URL的sqlite数据库,它占用了大量的磁盘空间,访问它会导致许多磁盘搜索并且速度很慢.平均URL路径长度为97个字节(主机名重复很多,因此我将它们移动到外键表).压缩它们有什么好方法吗?大多数压缩算法适用于大文档,而不是平均少于100字节的"文档",但即使减少20%也非常有用.任何可行的压缩算法?没有任何标准.
使用压缩算法但使用共享字典.
在使用Unix压缩命令使用的LZC/LZW算法之前,我已经做过类似的事情.
使用短字符串获得良好压缩的技巧是使用由您正在压缩的URL的标准样本组成的字典.
你应该轻松获得20%.
编辑:LZC是LZW的变种.您只需要LZW,因为您只需要一个静态字典.LZC增加了对字典/表格填满后重置的支持.
我已经尝试过使用以下策略.它使用的是共享字典,但是解决方法python的zlib并不能让你访问字典本身.
首先,通过运行一堆训练字符串来初始化预训练的压缩器和解压缩器.扔掉输出字符串.
然后,使用经过训练的压缩器的副本来压缩每个小字符串,并使用解压缩程序的副本对它们进行解压缩.
这里是我的python代码(为丑陋的测试道歉):
import zlib class Trained_short_string_compressor(object): def __init__(self, training_set, bits = -zlib.MAX_WBITS, compression = zlib.Z_DEFAULT_COMPRESSION, scheme = zlib.DEFLATED): # Use a negative number of bits, so the checksum is not included. compressor = zlib.compressobj(compression,scheme,bits) decompressor = zlib.decompressobj(bits) junk_offset = 0 for line in training_set: junk_offset += len(line) # run the training line through the compressor and decompressor junk_offset -= len(decompressor.decompress(compressor.compress(line))) # use Z_SYNC_FLUSH. A full flush seems to detrain the compressor, and # not flushing wastes space. junk_offset -= len(decompressor.decompress(compressor.flush(zlib.Z_SYNC_FLUSH))) self.junk_offset = junk_offset self.compressor = compressor self.decompressor = decompressor def compress(self,s): compressor = self.compressor.copy() return compressor.compress(s)+compressor.flush() def decompress(self,s): decompressor = self.decompressor.copy() return (decompressor.decompress(s)+decompressor.flush())[self.junk_offset:]
测试它,我在一堆10,000个短的(50 - > 300个字符)unicode字符串上节省了30%以上.压缩和解压缩它们也需要大约6秒钟(相比之下,使用简单的zlib压缩/解压缩大约需要2秒).另一方面,简单的zlib压缩节省了大约5%,而不是30%.
def test_compress_small_strings(): lines =[l for l in gzip.open(fname)] compressor=Trained_short_string_compressor(lines[:500]) import time t = time.time() s = 0.0 sc = 0. for i in range(10000): line = lines[1000+i] # use an offset, so you don't cheat and compress the training set cl = compressor.compress(line) ucl = compressor.decompress(cl) s += len(line) sc+=len(cl) assert line == ucl print 'compressed',i,'small strings in',time.time()-t,'with a ratio of',s0/s print 'now, compare it ot a naive compression ' t = time.time() for i in range(10000): line = lines[1000+i] cr = zlib.compressobj(zlib.Z_DEFAULT_COMPRESSION,zlib.DEFLATED,-zlib.MAX_WBITS) cl=cr.compress(line)+cr.flush() ucl = zlib.decompress(cl,-zlib.MAX_WBITS) sc += len(cl) assert line == ucl print 'naive zlib compressed',i,'small strings in',time.time()-t, 'with a ratio of ',sc/s
注意,如果删除它,它不会持久.如果你想要持久性,你必须记住训练集.