2815 words
14 minutes
260312_bit_pattern_training_BitHacks_SWAR

link#


2진수 이쁘게 출력하기|🔝|#

fn format_bin8(x: u8) -> String {
    let s = format!("{:08b}", x);
    format!("0b{} {}", &s[0..4], &s[4..8])
}

fn main() {
    let x: u8 = 4;
    println!("{}", format_bin8(x));
}
  • result
0b0000 0100

Boolean Logic|🔝|#

logicjava booleanjava bitwisecircuit design
NOT¬x\neg x!x~xx’
ANDxyx\land yx && yx & yxy
ORxyx\lor yx || yx | yx+y
XORxyx \oplus yx ^ yx ^ yxyx \oplus y

NOT operator truth table|🔝|#

xx’
¬x\neg x
~x
!x
01
10

AND operator truth table|🔝|#

xyx AND y
xy
xyx\land y
x && y
000
010
100
111

OR operator truth table|🔝|#

xyx OR y
x+y
xyx\lor y
x || y
000
011
101
111

XOR operator truth table|🔝|#

xyx XOR y
xyx \oplus y
x ^ y
000
011
101
110

Real kernel example(13번 참고하면됨. | 13. Align to power-of-two boundary)|🔝|#

Linux uses similar macros:

// macro_c_lang.c
#define ALIGN(x, a) (((x) + (a) - 1) & ~((a) - 1))
  • Example:
ALIGN(13,8) = 16
ALIGN(16,8) = 16
ALIGN(17,8) = 24
  • Rust version used in allocators
// main.rs
fn align_up(x: usize, align: usize) -> usize {
    debug_assert!(align.is_power_of_two());
    (x + align - 1) & !(align - 1)
}
  • The debug_assert! ensures correctness.
PurposeExpression
align up(x + a - 1) & !(a - 1)
align downx & !(a - 1)
check alignedx & (a - 1) == 0

15 Legendary Bit Hacks|🔝|#

#ExpressionPurpose쓰이는 곳
1x & (x - 1)clear lowest set bitUsed for bit counting loops.
2x & -xextract lowest set bit
3(x & (x - 1)) == 0check power of twokernel page sizes
memory alignment
ring buffers
4x | (x - 1)set bits below lowest set bit
5x & (x + 1)clear trailing ones
6~x & (x + 1)isolate lowest zero bit
7x ^ (x >> 1)binary → Gray coderotary encoders
hardware counters
error correction
8x ^ (x >> 1) ^ (x >> 2) ...Gray → binaryUsed in hardware decoding.
9x >> n & 1test nth bitCPU flags
protocol parsers
10x ^ (1 << n)toggle bit
11x & ~(1 << n)clear bit
12x | (1 << n)set bit
13(x + (1 << n)) & ~((1 << n) - 1)align to power-of-two boundarymemory allocators
page alignment
kernel slabs
14(x + y) ^ ((x ^ y) & -(x < y))branchless min/maxAdvantages: avoids branch misprediction
useful in SIMD
15x -= ((x >> 1) & 0x55555555)fast bit count (popcount step)Part of the SWAR popcount algorithm.
Counts bits in parallel inside a register.

Why programmers group bits in 4|🔝|#

  • Because 1 hex digit = 4 bits

  • Example:

0000 0100  = 0x04 = 4(decimal)
0010 1100  = 0x2C = 44(decimal)
1111 1111  = 0xFF = 255(decimal)
  • So this format makes debugging:
    • CPU registers
    • kernel bitmasks
    • network packets
    • SIMD values
  • much easier.

1. Clear lowest set bit|🔝|#

x & (x - 1)
  • Example
x      = 101100    # 0x2C    44(10dec)
x - 1  = 101011    # 0x2B    43(10dec)
result = 101000    # 0x14    20(10dec)
  • Used for bit counting loops.

2. Extract lowest set bit|🔝|#

x & -x
  • Example
101100
000100
  • Used in

    • Fenwick trees
    • memory allocators
    • bitboards
  • Rust연습

// main.rs

fn main() {
    let x = 44;        // 0010 1100    // 0x2C
                      //  1101 0100  -x = -44 
    let res = x & -x;
    let res02 = x & !x;
    let res03 = x | !x;
    let minus_x = -x;  // 1101 0100    // -44(10dec)
    let nagative_x = !x;
    println!("res : {res}"); // 0000 0100
    println!("res02 : {res02}");
    println!("res03 : {res03}");
    println!("minus_x : {minus_x}");
    println!("minus_x : {:08b}", minus_x as u8);
    println!("nagative_x : {nagative_x}");
    println!("4(dec) : 0b{:08b}", res as u8); // 0000 0100
    println!("-x(44dec) binary : 0b{:08b}", minus_x as u8);
}
  • result
res : 4
res02 : 0
res03 : -1
minus_x : -44
minus_x : 11010100
nagative_x : -45
4(dec) : 0b00000100
-x(44dec) binary : 0b11010100

3. Check power of two|🔝|#

(x & (x - 1)) == 0
  • Example
100000
&
011111
= 0
  • Used everywhere:

    • kernel page sizes
    • memory alignment
    • ring buffers
  • rust코드로 연습

// main.rs

fn main() {
    let x = 47; // 0010 1111
    //&(x-1) dec 46  // 0010 1110
    //       dec 46  // 0010 1110
    let res = (x & (x - 1)) == 0;
    println!("{res}");
}
  • result
false

4. Set bits below lowest set bit|🔝|#

x | (x - 1)

Example

101100

101111
  • Rust code로 연습
// main.rs

fn main() {
    let x = 44; // 0010 1100
// (x-1) dec 43 // 0010 1011
//              // 0010 1111
    let res = x | (x - 1);
    println!("{res}");
}
  • result
47   // 0010 1111

5. Clear trailing ones |🔝|#

x & (x + 1)
  • Example
101111

100000
  • rust로 연습
// main.rs

fn main() {
    let x = 47; //      0010 1111
    //&(x+1) dec 48  // 0011 0000
    //       dec 32  // 0010 0000
    let res = x & (x + 1);
    println!("{res}");
}
  • result
32 # 0010 0000

6. Find lowest zero bit |🔝|#

//main.c
~x & (x + 1)
  • Useful for bit allocation algorithms.

  • rust코드로 연습

// main.rs

fn main() {
    let x = 47; // 0010 1111
    // !x            // 1101 0000
    //&(x+1) dec 48  // 0011 0000
    //       dec 16  // 0001 0000
    let res = !x & (x + 1);
    println!("{res}");
}
  • result
16  # 0001 0000

7. Binary → Gray code |🔝|#

// main.c
gray = x ^ (x >> 1)
  • Example
binary:  1011
gray:    1110
  • rust code로 연습
// main.rs

fn bin4(x: u8) -> String {
    format!("{:04b}", x)
}

fn main() {
    let binary = 11; // 1011
    //        x >> 1 // 0101
    //   x ^ (x >>1) // 1110
    let gray = binary ^ (binary >> 1);

    println!("x    : {}", bin4(binary));
    println!("gray : {}", bin4(gray));
}
  • result
binary:  1011
gray:    1110

쓰이는곳#

  • Property:
  • Only one bit changes between consecutive numbers, which is why Gray code is used in:
    • rotary encoders
    • Karnaugh maps
    • digital circuits
    • error-resistant counters

8. Gray → Binary|🔝|#

// main.c
b = g
b ^= b >> 1
b ^= b >> 2
b ^= b >> 4
b ^= b >> 8
b ^= b >> 16

Used in hardware decoding.

  • Rust Code로 연습
// main.rs

fn gray_to_binary(mut b: u32) -> u32 {
    b ^= b >> 1;
    b ^= b >> 2;
    b ^= b >> 4;
    b ^= b >> 8;
    b ^= b >> 16;
    b
}

fn main() {
    let gray: u32 = 0b1110; // Gray code
    let binary = gray_to_binary(gray);

    println!("gray   : {:04b}", gray);
    println!("binary : {:04b}", binary);
}
  • result
gray   : 1110
binary : 1011
  • 똑같은 기능(이게 더 눈에 들어온다.ㅋ)
// main.rs

fn gray_to_binary(mut x: u32) -> u32 {
    let mut shift = 1;
    while shift < 32 {
        x ^= x >> shift;
        shift <<= 1;
    }
    x
}

9. Test bit |🔝|#

// main.c
(x >> n) & 1
  • Example
x = 101101
n = 2
 1
  • Used in:

    • CPU flags
    • protocol parsers
  • Rust code로 연습

// main.rs

fn main() {
    let n = 2; //  0000 0010
    let x = 45; // 0010 1101
    // 1             // 0000 0001
    //&(x >> 2)dec11 // 0000 1011
    //       dec 1   // 0000 0001
    let res = (x >> n) & 1;
    println!("{res}");
}
  • result
1

10. Toggle bit|🔝|#

// main.c
x ^ (1 << n)
101100
toggle bit2
 101000
// main.rs

fn main() {
    let n = 2; //  0000 0010
    let x = 44; // 0010 1100
//   let x: u8 = 0b101100; 44dec
    // 1             // 0000 0001
    // XOR ^
    //^(1 >> 2)dec11 // 0000 0100
    //       dec 1   // 0000 0001
    // res      40   // 0010 1000
    let res = x ^ (1 << n);
    println!("{res}");
}
  • result
40

11. Clear bit|🔝|#

// main.c
x & ~(1 << n)
  • Rust Code로 연습
// main.rs

fn main() {
    let negative_x = !(1 << 2);
    let n = 2; //   0000 0010
    let x = 45; //  0010 1101
    //  1            //  0000 0001
    // (1 << 2) 4dec //  0000 0100
    //\!(1 << 2)-5dec//  1111 1101
    //&!(1 << 2)     //  0000 1011
    //  res  41 dec  //  0010 1001
    let res = x & !(1 << n);
    println!("{negative_x}");
    println!("negative x binariy : {:8b}", negative_x as u8);
    println!("{res}");
}
  • result
-5
negative x binari : 11111011
41

12. Set bit |🔝|#

// main.c
x | (1 << n)
  • Example
101000
set bit2
 101100
// main.rs

fn main() {
    //         1dec //   0000 0001
    let n = 2; //   0000 0010
    let x = 45; //  0010 1101
    // (1 << 2) 4dec //  0000 0100
    //| (1 << 2)     //
    //  res  45 dec  //  0010 1101
    let res = x | (1 << n);
    println!("{res}");
}
  • result
45

13. Align to power-of-two boundary|🔝|#

// main.c 
(x + (a - 1)) & ~(a - 1)
  • is a very famous systems-programming trick used to align a number x up to the nearest multiple of a, where a is typically a power of two.

  • Example align to 16

x = 37
 48
  • Used in:

    • memory allocators
    • page alignment
    • kernel slabs
  • This appears in:

    • OS kernels
    • memory allocators
    • page alignment
    • SIMD buffers
  • Rust Code로 연습하기

// (a - 1) // 7  , a = 8
// 8   0000 1000
// 7   0000 0111
// 13  0000 1101
// +   0001 0100  (20dec)

//                       0001 0100  (20dec)
// !(a - 1) // 248u8   0b1111 1000
//                       0001 0000  (16dec)  align 16

fn align_up(x: usize, a: usize) -> usize {
    (x + (a - 1)) & !(a - 1)
}

fn main() {
    let x = 13;
    let a = 8;

    let aligned = align_up(x, a);

    let negative_a_minus = !(a - 1);

    println!("x       : {}", x);
    println!("aligned : {}", aligned);

    println!(" ~~~~");
    println!("negative!(8 -1) : {}", negative_a_minus as u8);
    println!("negative!(8 -1) : {:08b}", negative_a_minus as u8);
}
  • result
x       : 13
aligned : 16
 ~~~~
negative!(8 -1) : 248
negative!(8 -1) : 11111000

14. Branchless min/max|🔝|#

// main.c

min = y ^ ((x ^ y) & -(x < y))
  • Advantages:

    • avoids branch misprediction
    • useful in SIMD
  • Rust code로 연습하기

  • MIN

fn branchless_min(x: i32, y: i32) -> i32 {
    y ^ ((x ^ y) & -((x < y) as i32))
}
  • MAX
fn branchless_max(x: i32, y: i32) -> i32 {
    x ^ ((x ^ y) & -((x < y) as i32))
}

MIN#

// 10   // 0000 1010
// 6    // 0000 0110
// x ^ y // 0000 1100    // 10 dec

fn branchless_min(x: i32, y: i32) -> i32 {
    y ^ ((x ^ y) & -((x < y) as i32))
}

fn main() {
    let x = 10;
    let y = 6;

    let min = branchless_min(x, y);

    println!("x   : {}", x);
    println!("y   : {}", y);
    println!("min : {}", min);

    let test_eval = x < y as i32;
    println!("10 < 6 as i32 : {}", test_eval);
    println!("-(10 < 6 as i32) : {}", -(test_eval as i32));
    // println!("10 < 6 as i32 : {:08b}", test_eval);
}
  • result(MIN)
x   : 10
y   : 6
min : 6
10 < 6 as i32 : false
-(10 < 6 as i32) : 0

MAX#

// 10   // 0000 1010
// 6    // 0000 0110
// x ^ y // 0000 1100    // 10 dec

fn branchless_max(x: i32, y: i32) -> i32 {
    x ^ ((x ^ y) & -((x < y) as i32))
}

fn main() {
    let x = 10;
    let y = 6;

    let max = branchless_max(x, y);

    println!("x   : {}", x);
    println!("y   : {}", y);
    println!("max : {}", max);

    let test_eval = x < y as i32;
    println!("10 < 6 as i32 : {}", test_eval);
    println!("-(10 < 6 as i32) : {}", -(test_eval as i32));
    // println!("10 < 6 as i32 : {:08b}", test_eval);
}

- result(MAX)

```bash
x   : 10
y   : 6
max : 10
10 < 6 as i32 : false
-(10 < 6 as i32) : 0

Why this trick exists#

  • This appears in:

    • CPU micro-optimizations
    • vectorized math libraries
    • GPU shaders
    • compilers
    • cryptography
  • because branches can cause pipeline stalls.

  • Modern compilers often convert

// max
x.max(y);

// min
x.min(y);
  • into branchless instructions automatically.

15. Fast popcount step |🔝|#

// main.c

x -= (x >> 1) & 0x55555555
  • is the first step of the SWAR popcount algorithm

  • Counts bits in parallel inside a register.

    • (SIMD Within A Register) used to count bits in parallel.

What 0x55555555 means#

0x55555555
= 01010101 01010101 01010101 01010101
  • It selects alternating bits.
// 214_dec                                    1101 0110
// 0x5555_5555  (1,431,655,765_dec)
// 0x5555_5555  0101 0101 0101 0101 0101 0101 0101 0101
// (x >> 1)                                   0110 1011
// &         65_dec                           0100 0001
//
// result (149_dec)
//              0000_0000_0000_0000_0000_0000_1001_0101
fn main() {
    let mut x: u32 = 0b1101_0110;

    x -= (x >> 1) & 0x5555_5555;

    println!("result: {:032b}", x);
}
  • result
result: 0000 0000 0000 0000 0000 0000 1001 0101

Base64기초|🔝|#

  • Rust Code로 해보기
fn main() {
    let input = "test";
    let bytes = input.as_bytes();

    let i: usize = 0;

    let b0 = bytes[i];
    let b1 = if i + 1 < bytes.len() { bytes[i + 1] } else { 0 };
    let b2 = if i + 2 < bytes.len() { bytes[i + 2] } else { 0 };

    // Combine into 24 bits
    let triple: u32 =
        ((b0 as u32) << 16) |
        ((b1 as u32) << 8)  |
        (b2 as u32);

    println!("{input}");
    println!("{b0}");
    println!("{b1}");
    println!("{triple}");
    println!("{:b}", triple);
}
  • Why input[i] fails

    • Rust strings are UTF-8, not byte arrays.
  • So you must do:

input.as_bytes()[i]
  • You are building a 24-bit integer from 3 bytes using bit operations.
  • The operator | is bitwise OR,
    • and << is left shift.
  • Let’s go step-by-step with your example:
input = "test"
bytes = [116, 101, 115, 116]
  • ASCII values:
t = 116 = 0b0111 0100
e = 101 = 0b0110 0101
s = 115 = 0b0111 0011
  • So
b0 = 116
b1 = 101
b2 = 115

  • Step 1 — Shift b0 left 16 bits
(b0 as u32) << 16
01110100 << 16

= 01110100 00000000 00000000
  • Binary (24 bits):
01110100 00000000 00000000
  • Decimal:
116 << 16 = 7602176

  • Step 2 — Shift b1 left 8 bits
(b1 as u32) << 8
01100101 << 8

= 00000000 01100101 00000000
  • Binary:
00000000 01100101 00000000
  • Decimal:
101 << 8 = 25856

  • Step 3 — b2 stays in lowest 8 bits
(b2 as u32)
00000000 00000000 01110011
115

  • Step 4 — Combine using | (bitwise OR)
((b0 << 16) | (b1 << 8) | b2)
  • Bitwise OR combines bits:
01110100 00000000 00000000
00000000 01100101 00000000
00000000 00000000 01110011
--------------------------------
01110100 01100101 01110011
  • Final result:
0111 0100 0110 0101 0111 0011
  • Decimal:
7602176
+ 25856
+   115
---------
7628147

  • Step 5 — Why this is done

  • This packs 3 bytes into one 24-bit number.

  • Layout:

[ b0 ][ b1 ][ b2 ]
 8bit 8bit 8bit
b0 -> high byte
b1 -> middle byte
b2 -> low byte
  • This is used in:
    • Base64 encoding
    • Network protocols
    • Image formats
    • Compression
    • Crypto
    • Binary parsers

  • Step 6 — Why | works

  • Bitwise OR:

0 | 0 = 0
1 | 0 = 1
0 | 1 = 1
1 | 1 = 1
  • Since each byte is shifted to a different region, OR just merges them.
AAAA0000
0000BBBB
000000CC
---------
AAAABBBBCC

  • Step 7 — This is exactly Base64 step

  • Base64 takes:

3 bytes 24 bits split into 4 × 6 bits
  • You just finished the first step of Base64 encoder.


SWAR popcount algorithm (the famous 5-mask method)|🔝|#

  • one of the coolest tricks in low-level programming.

  • The SWAR popcount (5-mask method) counts bits in parallel inside one register.

    • It is called SIMD Within A Register because each step sums bits in groups:
1-bit → 2-bit → 4-bit → 8-bit → 16-bit → 32-bit
  • This is the famous algorithm used in old compilers, kernels, and graphics code before POPCNT.

Full SWAR popcount code (reference)#

// main.c

x -= (x >> 1) & 0x55555555;
x = (x & 0x33333333) + ((x >> 2) & 0x33333333);
x = (x + (x >> 4)) & 0x0F0F0F0F;
x += x >> 8;
x += x >> 16;
x &= 0x3F;
  • rust
// main.rs

fn popcount(mut x: u32) -> u32 {
    x -= (x >> 1) & 0x5555_5555;
    x = (x & 0x3333_3333) + ((x >> 2) & 0x3333_3333);
    x = (x + (x >> 4)) & 0x0F0F_0F0F;
    x += x >> 8;
    x += x >> 16;
    x & 0x3F
}

Example value#

x = 11010110  (5 bits set)
  • We will track how the register changes.

Step 1 — count per 2 bits#

mask = 0x55555555
= 01010101 01010101 01010101 01010101
x        11010110
x>>1     01101011
mask     01010101
&        01000001
----------------
result   10010101
  • Interpret as pairs:
10 01 01 01
  • Each pair = number of 1s in original pair.
2  1  1  1

Step 2 — count per 4 bits#

mask = 0x33333333
= 00110011 00110011
x = 10010101

x & mask         00010001
(x>>2) & mask    00100001
-------------------------
sum              00110010
  • Now per 4 bits:
0011 0010
  • Meaning:
3   2

Step 3 — count per 8 bits#

mask = 0x0F0F0F0F
= 00001111 00001111
x = 00110010

x + (x>>4)

00110010
00000011
---------
00110101

& mask

00000101
  • Now:
00000101
  • Meaning:
5
  • We already have the popcount in 8 bits.

Step 4 — sum 8-bit groups#

x += x >> 8
  • For 32-bit numbers this merges bytes.

  • Visual:

00000000 00000000 00000000 00000101
  • No change here because small number.

Step 5 — sum 16-bit groups#

x += x >> 16
  • Still:
00000101

Final mask#

x & 0x3F
  • Why?
    • Max popcount of 32-bit = 32
32 = 100000
  • Needs 6 bits → mask 0x3F
00111111

Final result#

11010110 5

Why this is called SWAR|🔝|#

  • Because operations happen in parallel:
StepGroup size
12 bits
24 bits
38 bits
416 bits
532 bits
  • This is like SIMD but inside one register.

Why this was famous(SWAR)|🔝|#

  • Used in:
    • old GCC / Clang
    • Linux kernel
    • Quake engine
    • graphics pipelines
    • cryptography
    • chess engines
    • Before CPUs had:
POPCNT
  • instruction.

Modern Rust#

  • Today you just write:
x.count_ones()
  • But internally the compiler may still use tricks like this on old CPUs.
260312_bit_pattern_training_BitHacks_SWAR
https://younghakim7.github.io/blog/posts/260312_bit_pattern_training_bithacks_swar/
Author
YoungHa
Published at
2026-03-12