What is a Bloom filter?
Bloom filter is a kind of random data structure with high space efficiency, which is specially used to detect whether there are specific elements in the set. The Bloom filter is a data structure composed of a bit array with a length of m bits and k independent hash functions. The bit array is initialized to 0, and all hash functions can respectively hash the input data as evenly as possible. When you want to insert an element into the Bloom filter, the element is calculated by k hash functions to generate k hash values, and the hash value is used as the subscript in the bit array, and all k corresponding bit values are calculated by 0 is set to 1. When an element is to be queried, it is also calculated through a hash function to generate a hash value, and then the corresponding k bit value is checked: if any bit is 0, it means that the element must not be in the set; if all bits are 1, indicating that the element may be in the set
Due to the possibility of hash collisions, the hash values calculated by different elements may be the same, resulting in a non-existent element may correspond to a bit of 1, which is the so-called "false positive" (false positive). In contrast, "false negative" (false negative) will never appear in BF. Therefore, Bloom Filter is not suitable for those "zero error" applications. In applications where a low error rate can be tolerated, Bloom Filter exchanges very few errors for a great savings in storage space.
Therefore, what the Bloom filter thinks is not in the collection; what the Bloom filter thinks is not necessarily in the collection
Algorithm
- Select k hash functions, denoted as {h1,h2,...,hk}
- Suppose there are n elements that need to be mapped into the bit array, and the length of the bit array is m. Initially, the element of each position of the m-bit bit array is set to 0
- Now, map the n elements to the position of the bit array using the k hash functions selected in step 1, and the element of the position where the bit array is mapped becomes 1. Obviously, an element can be mapped to k positions. The process is shown in the figure, now the element set {x,y,z} is mapped to a binary array through 3 hash functions
- Finally, when you need to check whether an element is in the existing set, use these k hash functions to map the element to be judged to the position of the bit array, as long as the bit array is mapped to a bit in the bit array If it is not 1, it must indicate that this element is not in the existing set. As shown in the figure, when checking whether w is in the set, a hash function maps ww to the position where the element of the bit array is 0.
PHP + Redis implementation
<?php
/**
* BitSet simulates BitSet in PHP, you can use Array instead
*/
class BitSet {
protected $bit = [];
public function add($index) {
$this->bit[$index] = 1;
}
public function has($index) {
if(isset($this->bit[$index])) {
return true;
}
return false;
}
}
/**
* Hash algorithm
*/
class HashFunction
{
protected $size;
protected $seed;
public function __construct($size, $seed)
{
$this->size = $size;
$this->seed = $seed;
}
public function hash($value)
{
$r = 0;
$l = strlen($value);
for ($i = 0; $i <$l; $i++) {
$r = $this->seed * $r + ord($value[$i]);
}
return ($this->size-1) & $r;
}
}
/**
* Bloom filter
*/
class BloomFilter
{
/** @var int A bit length of 1 billion */
protected $size = 256 << 22;
/** @var int[] In order to reduce the error rate, the additive hash algorithm is used, so an 8-element prime number array is defined */
protected $seeds = [3, 5, 7, 11, 13, 31, 37, 61];
/** @var HashFunction[] is equivalent to building 8 different hash algorithms */
protected $functions = [];
/** @var BitSet[] Initialize the bitmap of the Bloom filter */
protected $bitset = [];
public function __construct($size = null, $seeds = null)
{
if($seeds != null) {
$this->size = $size;
}
if($seeds != null) {
$this->seeds = $seeds;
}
foreach ($this->seeds as $v) {
$this->functions[$v] = new HashFunction($this->size, $v);
$this->bitset[$v] = new BitSet();
}
}
/**
* @param string $str
*/
public function add($str) {
if($str != null) {
foreach ($this->functions as $k => $fun) {
$this->bitset[$k]->add($fun->hash($str));
}
}
}
/**
* @param string $str
* @return bool
*/
public function has($str) {
if($str != null) {
foreach ($this->functions as $k => $fun) {
if(!$this->bitset[$k]->has($fun->hash($str))) {
return false;
}
}
return true;
}
return false;
}
}
function d($str) {
if(is_array($str)) {
echo "[Array]:\n";
print_r($str);
echo "\n";
} else if(is_object($str)) {
echo "[Object]:\n";
print_r($str);
echo "\n";
} else if(is_bool($str)) {
echo "[Boolean]: ". ($str? "true": "false"). "\n";
} else {
echo "[String]: ". $str. "\n";
}
}
/*
* Test function
*/
d("Initialize Bloom Filter");
$f = new BloomFilter;
$l1 = 10000;
$l2 = 10;
d("Add initial data");
for ($i = 0; $i <$l1; $i ++) {
$f->add("#". $i);
// d("add #{$i}");
}
d("Testing and judging random numbers {$l2}");
for ($i = 0; $i <$l2; $i ++) {
// $s = "#". rand($l2, $l2 + 1000);
// $s = "#0";
$s = "#". rand(0, $l1 * 2);
$r = $f->has($s);
d("Judging number {$s} >> ". ($r? "true": "false"));
}
Post comment 取消回复