add arm64 implementation; silently ignore bad clock runs
add 386 support (thanks foura and Amavect)
change read/write sizes
change benchmarks, grouping
clean up assembly mess
use proper serialization; use RDTSCP if possible
add more sleep examples
better formatting; better percentiles
wire to cpu 0, remove memmove benchmarks
fix wrong op/s calculation