汉明距离——提高海量查询性能的方法 - 高飞网
266 人阅读

汉明距离——提高海量查询性能的方法

2017-07-28 02:09:46

    在前面讨论的三种图片识别的哈希算法,最后都是通过比较哈希的相似度,即汉明距离实现的。虽然比较两个hash值的汉明距离非常快,但受不住数据的爆发式增长,在海量的数据中找出两个相似的hash值,性能也会慢慢变差,显然这种最基本的顺序查找,无法扩展到数以亿计的数据中。

    在图片的识别中,汉明距离在0~10之间认为是相似的。如果采用顺序查找,查询完的次数为1万次。

算法原理

    如果把hash值分成11份,那么两个hash值相同,则必有一块区域是完全相同的。这个分法不太科学,我们可以把hash值分为8份,这样如果每部分都不相同,则汉明距离肯定大于8;相反,如果汉明距离小于8,则至少有一块是相同的。

    按照这个原理,分以下步骤,可以对海量数据建立索引:

  1. 将64位hash值分成8等份。
  2. 调整上面64位hash,将任意一块作为前8位,总共有8个table
  3. 采用精确匹配的方式,查找前8位
  4. 如果查找到,再精确判断这里的hash值。

java实现


public class DIndex implements Serializable {

    private static final long serialVersionUID = -4463444087393922139L;
    List<Map<Integer, List<Long>>> index_store = new ArrayList<Map<Integer, List<Long>>>();
    Map<Long, String> data = new HashMap<Long, String>();
    static final int STORE_COUNT = 8;
    static final int MAX_DIS = 25;
    private String indexPath = "";

    /**
     * 
     * @param indexPath
     *            索引保存路径
     */
    public DIndex(String indexPath) {
        this.indexPath = indexPath;
        File file = new File(indexPath);
        if (!file.exists()) {
            try {
                file.createNewFile();
            } catch (IOException e) {
                e.printStackTrace();
            }
        }
        // 初始化8个索引库
        for (int i = 0; i < STORE_COUNT; i++) {
            index_store.add(new HashMap<Integer, List<Long>>());
        }
    }

    public void index(String image) throws IOException {
        long fingerprint = DHash.fingerprint(image);
        intoIndex(fingerprint);
        data.put(fingerprint, image);
    }

    public void intoIndex(Long fingerprint) {
        int subHash[] = subHash(fingerprint);
        for (int i = 0; i < STORE_COUNT; i++) {
            int hash = subHash[i];
            Map<Integer, List<Long>> map = index_store.get(i);
            intoIndex(hash, fingerprint, map);
        }
    }

    public void intoIndex(Integer key, Long value, Map<Integer, List<Long>> index) {
        List<Long> list = index.get(key);
        if (list == null) {
            list = new ArrayList<Long>();
        }
        list.add(value);
        index.put(key, list);
    }

    public Top<String, Integer> search(String image) throws IOException {
        long fingerprint = DHash.fingerprint(image);
        return search(fingerprint);
    }

    public Top<String, Integer> search(long finger0) throws IOException {
        int subHash[] = subHash(finger0);
        Top<String, Integer> top = new Top<String, Integer>();
        for (int hash : subHash) {
            for (Map<Integer, List<Long>> ind : index_store) {
                List<Long> fingers = ind.get(hash);
                if (fingers != null) {
                    for (Long finger : fingers) {
                        int dis = HammingDistance.distance(finger0, finger);
                        if (dis < MAX_DIS) {
                            String file = data.get(finger);
                            top.add(file, dis);
                        }
                    }
                }
            }
        }
        return top;
    }

    public int[] subHash(long fingerprint) {
        int[] subHash = new int[STORE_COUNT];
        for (int i = 56; i >= 0; i -= STORE_COUNT) {
            int hash = (int) (fingerprint >> i) & 0xff;
            subHash[STORE_COUNT - i / STORE_COUNT - 1] = hash;
        }
        return subHash;
    }

    public Top<String, Integer> fullSearch(String toFind) throws IOException {
        long fingerprint = DHash.fingerprint(toFind);
        String find = "";
        Top<String, Integer> top = new Top<String, Integer>();
        for (Long f : data.keySet()) {
            int dis = HammingDistance.distance(fingerprint, f);
            if (dis < MAX_DIS) {
                find = data.get(f);
                top.add(find, dis);
            }
        }
        return top;
    }

    public void write() throws IOException {
        FileOutputStream fout = new FileOutputStream(indexPath);
        ObjectOutputStream out = new ObjectOutputStream(fout);
        out.writeObject(this);
        out.close();
    }

    public boolean canReload() {
        return new File(indexPath).exists();
    }

    public DIndex reload() {
        try {
            FileInputStream fin = new FileInputStream(indexPath);
            ObjectInputStream in = new ObjectInputStream(fin);
            DIndex dindex = (DIndex) in.readObject();
            in.close();
            return dindex;
        } catch (Exception e) {
            e.printStackTrace();
        }
        return null;
    }
}


还没有评论!
54.80.157.133