The breakthrough was achieved by implementing tons of fine-grained optimizations and usage of assembly-like PTX (Parallel Thread Execution) programming instead of Nvidia's CUDA, according to an ...