11 Jan 2020
A core feature of capnproto-rust is its ability to
read messages directly from memory without copying the data into auxiliary structures.
Unfortunately, this functionality is a bit tricky to use correctly,
as can be seen in its primary interface, the
read_message_from_words()
function, whose input is of type &[Word]
.
In the common case where you want to read from a &[u8]
,
you must first call the unsafe function
bytes_to_words()
in order to get a &[Word]
.
It is only safe to call this function if you know that your data is
8-byte aligned or if you know that your code will only run on processors
that permit unaligned memory access (EDIT: ralfj informs me that misaligned loads are never okay.)
The former condition can be difficult to meet, especially if your memory comes from
an external library like sqlite or zmq where no alignment guarantees are given,
and the latter condition feels like an unfair burden, both in terms of demanding that
you understand a rather subtle concept, and in terms of limiting where your software can run.
So it’s easy to understand why someone might shy away from calling bytes_to_words()
and, in turn, read_message_from_words()
.
Can we do better? Ideally, capnproto-rust would safely operate directly on input of type &[u8]
.
We can in fact adapt the code to do that, but it comes at a cost: processors that don’t natively
support unaligned access will need to do some more work every time that capnproto-rust
loads or stores a multi-byte value.
To get some idea of what that extra work looks like, let’s examine
the assembly code emitted by rustc!
(A better way to quantify the cost would be to perform controlled experiments on actual hardware,
but that’s a more involved project than I’d like to tackle right now.)
Below is some code representing a bare-bones simplification of the two approaches to memory access.
(The #[no_std]
and #[no_mangle]
attributes are to simpify the assembly code.)
#![no_std]
#[no_mangle]
pub unsafe fn direct_load(x: &[u8; 8]) -> u64 {
(*(x.as_ptr() as *const u64)).to_le()
}
#[no_mangle]
pub fn indirect_load(x: &[u8; 8]) -> u64 {
u64::from_le_bytes(*x)
}
The direct_load()
function represents the current state of affairs in capnproto-rust.
It loads a u64
by casting a pointer of type *const u8
to type *const u64
and then deferencing that pointer.
This is only safe if the input is 8-byte aligned or if the processor can handle unaligned access.
(EDIT: again, see ralfj’s reddit comment.)
The indirect_load()
function represents the safer alternative. We expect this to
sometimes require more work than direct_load()
, but it has the advantage of
being easier to use and understand.
To compare the assembly code generated by these functions, I installed
a variety of rustc targets using rustup
:
rustup target add $TARGET
and then for each target compiled the code with:
rustc -O --crate-type=lib test.rs --target=$TARGET --emit=asm
The results, edited to only include the relevant bits of code, are show below.
direct_load:
movq (%rdi), %rax
retq
indirect_load:
movq (%rdi), %rax
retq
direct_load:
movl 4(%esp), %ecx
movl (%ecx), %eax
movl 4(%ecx), %edx
retl
indirect_load:
movl 4(%esp), %ecx
movl (%ecx), %eax
movl 4(%ecx), %edx
retl
direct_load:
ldr x0, [x0]
ret
indirect_load:
ldr x0, [x0]
ret
direct_load:
local.get 0
i64.load 0
indirect_load:
local.get 0
i64.load 0:p2align=0
direct_load:
ldrd r0, r1, [r0]
bx lr
indirect_load:
ldr r2, [r0]
ldr r1, [r0, #4]
mov r0, r2
bx lr
direct_load:
li 4, 4
lwbrx 5, 3, 4
lwbrx 4, 0, 3
mr 3, 5
blr
indirect_load:
li 4, 4
lwbrx 5, 3, 4
lwbrx 4, 0, 3
mr 3, 5
blr
direct_load:
lw $1, 4($4)
wsbh $1, $1
rotr $2, $1, 16
lw $1, 0($4)
wsbh $1, $1
jr $ra
rotr $3, $1, 16
indirect_load:
lwl $1, 4($4)
lwr $1, 7($4)
wsbh $1, $1
rotr $2, $1, 16
lwl $1, 0($4)
lwr $1, 3($4)
wsbh $1, $1
jr $ra
rotr $3, $1, 16
direct_load:
addi sp, sp, -16
sw ra, 12(sp)
sw s0, 8(sp)
addi s0, sp, 16
lw a2, 0(a0)
lw a1, 4(a0)
mv a0, a2
lw s0, 8(sp)
lw ra, 12(sp)
addi sp, sp, 16
ret
indirect_load:
addi sp, sp, -16
sw ra, 12(sp)
sw s0, 8(sp)
addi s0, sp, 16
lbu a1, 1(a0)
slli a1, a1, 8
lbu a2, 0(a0)
or a1, a1, a2
lbu a2, 3(a0)
slli a2, a2, 8
lbu a3, 2(a0)
or a2, a2, a3
slli a2, a2, 16
or a2, a2, a1
lbu a1, 5(a0)
slli a1, a1, 8
lbu a3, 4(a0)
or a1, a1, a3
lbu a3, 6(a0)
lbu a0, 7(a0)
slli a0, a0, 8
or a0, a0, a3
slli a0, a0, 16
or a1, a0, a1
mv a0, a2
lw s0, 8(sp)
lw ra, 12(sp)
addi sp, sp, 16
ret
As expected, direct_load()
and indirect_load()
generate the same
assembly code for many targets. These are presumably exactly the targets that support
unaligned memory access. On targets where different instructions were generated
for the two functions, indirect_load()
typically requires somewhere between 2x and 3x the
number of instructions of direct_load()
. Is that an acceptable cost? How much of an
impact would it have in the context of a complete real-world program? I don’t know!
I’m inclined to believe that the usability benefits of the
indirect_load()
approach outweigh its performance
cost, especially since that cost is probably zero or negligible on the most commonly used targets,
but maybe that’s not true?
I encourage any readers of this post who have thoughts on the matter to comment
on this github issue.