Linux on GPU

Linux on GPU

Ehh.. what?

Who am I

Who am I

  • Joined NVIDIA in 2010
  • PhysX/APEX
  • NVIDIA Ansel
  • NVIDIA Omniverse

Past experiments

Past experiments

Past experiments

Gameboy on GPU

Past experiments

Gameboy on GPU

Past experiments

Gameboy on GPU

So what this is about?

So what this is about?

Is it possible to run something more complex than Gameboy on a GPU?

So what this is about?

Is it possible to run something more complex than Gameboy on a GPU?

Something like.. Linux?

Emulator 101

Emulator 101

CPU

RAM

I/O

Emulator 101

CPU (semu)

RAM

I/O

Emulator 101

CPU (semu)

RISC-V RV32IMA

RAM

I/O

Emulator 101

CPU on CPU

Emulator 101

CPU on GPU

Emulator 101

CPU on GPU

Why do we emulate only a batch of instructions?

Emulator 101

CPU on GPU

Why do we emulate only a batch of instructions?

WDDM vs TCC driver mode

WDDM: 2 second CUDA program runtime limit

Emulator 101

CPU on GPU

Emulator 101

CPU on GPU

Emulating 320 instances

Emulator 101

CPU on GPU

Emulating 320 instances

Emulator 101

CPU on GPU

Emulating 320 instances


320 warps, 1 thread each 10 warps, 32 threads each
? ?

Emulator 101

CPU on GPU

Emulating 320 instances

320 warps, 1 thread each 10 warps, 32 threads each

Emulator 101

CPU (semu)

RAM

I/O

Emulator 101

RAM

Emulator 101

RAM

What’s dtb?

Emulator 101

What’s dtb?

Emulator 101

What’s dtb?

Device Tree Source, which is a definition of the machine. Describes CPU, RAM, initrd, I/O, interrupt controllers, etc

      /dts-v1/;

      / {
          model = "semu";

          aliases {
            serial0 = "/soc/serial@4000000";
          };

          chosen {
            bootargs = "earlycon console=ttyS0";
            stdout-path = "serial0";
            linux,initrd-start = <0x1f700000>; /* @403 MiB (503 * 1024 * 1024)     \*/
            linux,initrd-end = <0x1fefffff>;   /* @511 MiB (511 * 1024 * 1024 - 1) */
          };

          cpus {
            timebase-frequency = <65000000>;
            cpu0: cpu@0 {
                            device_type = "cpu";
                            compatible = "riscv";
                            reg = <0>;
                            riscv,isa = "rv32ima";
                            mmu-type = "riscv,rv32";
                            cpu0_intc: interrupt-controller {
                              interrupt-controller;
                              compatible = "riscv,cpu-intc";
                            };
            };
          };

          sram: memory@0 {
                          device_type = "memory";
                          reg = <0x00000000 0x20000000>;
                          reg-names = "sram0";
          };

          soc {
                          compatible = "simple-bus";
                          ranges = <0x0 0xF0000000 0x10000000>;
                          interrupt-parent = <&plic0>;

                          plic0: interrupt-controller@0 {
                              compatible = "sifive,plic-1.0.0";
                              reg = <0x0000000 0x4000000>;
                              interrupt-controller;
                              interrupts-extended = <&cpu0_intc 9>;
                              riscv,ndev = <31>;
                          };

                          serial@4000000 {
                              compatible = "ns16550";
                              reg = <0x4000000 0x100000>;
                              interrupts = <1>;
                              no-loopback-test;
                              clock-frequency = <5000000>; /* the baudrate divisor is ignored */
                          };

                          net0: virtio@4100000 {
                              compatible = "virtio,mmio";
                              reg = <0x4100000 0x100000>;
                              interrupts = <2>;
                          };
          };
        };

Emulator 101

RAM

A pointer to the DTB block in RAM is placed in the CPU register A1 before the machine starts

Emulator 101

RAM

What’s initrd?

Emulator 101

What’s initrd?

Emulator 101

What’s initrd?

Root filesystem in memory to bootstrap the system

Emulator 101

What’s initrd?

Root filesystem in memory to bootstrap the system

Typically is used to help with drivers needed during the boot, but we’re going to use it as the filesystem

Emulator 101

CPU (semu)

RAM

I/O

Emulator 101

I/O

MMU, PLIC, UART, virtio-net

Emulator 101

I/O

MMU, PLIC, UART, virtio-net

Emulator 101

I/O

MMU, PLIC, UART, virtio-net

Linux kernel loads!

64 VMs

(8x play speed)

Now what?

Now what?

It’s dull without I/O

Now what?

It’s dull without I/O

So let’s make use of network

Now what?

Now what?

Packet rewrite

Host → VM

Packet rewrite

Host → VM

How do we actually "put the packet into the appropriate VM’s virtio-net RX buffer"?

Packet rewrite

Host → VM

How do we actually "put the packet into the appropriate VM’s virtio-net RX buffer"?

CUDA Managed Memory to the rescue

Packet rewrite

Host → VM

How do we actually "put the packet into the appropriate VM’s virtio-net RX buffer"?

CUDA Managed Memory to the rescue

Great for lots of small writes

Packet rewrite

A small remark

CUDA Managed Memory to the rescue

Great for lots of small writes

TCC mode

Packet rewrite

A small remark

CUDA Managed Memory to the rescue

Great for lots of small writes

TCC mode

Allows to run emulation completely independently at its own pace

Packet rewrite

Host → VM

Packet rewrite

VM → Host

Packet rewrite

Host → VM

Packet rewrite

Host → VM

Packet rewrite

Host → VM


dduka@dduka-mlt: iptables -A INPUT -p tcp --dport 4000 
--tcp-flags RST RST -j DROP
  

Packet rewrite

Host → VM

WinDivert

https://github.com/basil00/WinDivert

Debugging

Debugging

VM network stack drops inbound packets

Debugging

VM network stack drops inbound packets

Breakpoint in the emulator I/O code

Debugging

VM network stack drops inbound packets

Breakpoint in the emulator I/O code

Linux kernel picks up the packet from RX buffer

Debugging

VM network stack drops inbound packets

Breakpoint in the emulator I/O code

Linux kernel picks up the packet from virtio-net RX buffer

Good

Debugging

VM network stack drops inbound packets

Breakpoint in the emulator I/O code

Linux kernel picks up the packet from virtio-net RX buffer

Good

Something is up at L2 (OSI) or higher

Debugging

VM network stack drops inbound packets

Breakpoint in the emulator I/O code

Linux kernel picks up the packet from virtio-net RX buffer

Good

Something is up at L2 (OSI) or higher

Need more tooling

Buildroot

https://buildroot.org/

A helpful tool to build a kernel and

a root filesystem

Debugging

Using buildroot recompile initrd to include ethtool

Debugging

Using buildroot recompile initrd to include ethtool

Handy to observe the amount of RX/TX packets visible to user space

Debugging

Using buildroot recompile initrd to include ethtool

Handy to observe the amount of RX/TX packets visible to user space

Packets are visible, but they are still dropped

Debugging

Using buildroot recompile initrd to include tcpdump

Debugging

Using buildroot recompile initrd to include tcpdump

Start tcpdump as part of /init script, because there is no keyboard input :/

Debugging

Using buildroot recompile initrd to include tcpdump

Start tcpdump as part of /init script, because there is no keyboard input :/

MAC address mismatch!

Debugging

Using buildroot recompile initrd to include tcpdump

Start tcpdump as part of /init script, because there is no keyboard input :/

MAC address mismatch!

Fixed packet rewrite and manually added ARP entry to match VM MAC to VM IP

Debugging


  printf "Starting network: \n"
  ifconfig eth0 192.168.0.113 netmask 255.255.255.0 
  hw ether 34:6f:24:4a:95:53 up
  printf "ifconfig info: \n"
  printf "$(ifconfig)"
  printf "\n---------------------------\n"
  printf "Updating ARP entry for 192.168.0.112\n"
  arp -s 192.168.0.112 33:6f:24:4a:95:53
  

Debugging

Packets still do not reach nginx :/

Debugging

Packets still do not reach nginx :/

tcpdump: packets are dropped by the network stack because checksums are wrong (byte order)

Debugging

Packets still do not reach nginx :/

tcpdump: packets are dropped by the network stack because checksums are wrong (byte order)

Fixing packet rewrite (again)

Debugging

Packets still do not reach nginx :/

Debugging

Packets still do not reach nginx :/

Recompile the kernel (again) with iptables support

Debugging

Packets still do not reach nginx :/

Recompile the kernel (again) with iptables support


  iptables -t raw -A PREROUTING -p tcp -j TRACE
  iptables -t raw -A OUTPUT -p tcp -j TRACE
  

Debugging

Packets still do not reach nginx :/

Recompile the kernel (again) with iptables support


  iptables -t raw -A PREROUTING -p tcp -j TRACE
  iptables -t raw -A OUTPUT -p tcp -j TRACE
  

The routing table is empty

Debugging

Packets still do not reach nginx :/

Recompile the kernel (again) with iptables support


  iptables -t raw -A PREROUTING -p tcp -j TRACE
  iptables -t raw -A OUTPUT -p tcp -j TRACE
  

The routing table is empty

Debugging


  printf "Starting network: \n"
  ifconfig eth0 192.168.0.113 netmask 255.255.255.0 
  hw ether 34:6f:24:4a:95:53 up
  printf "ifconfig info: \n"
  printf "$(ifconfig)"
  printf "\n---------------------------\n"
  printf "Updating ARP entry for 192.168.0.112\n"
  arp -s 192.168.0.112 33:6f:24:4a:95:53
  printf "Adding default route\n"
  route add default gw 192.168.0.1 dev eth0

Last touches

Last touches

Create a browser webpage for the host

Last touches

Create a browser webpage for the host

A single page accessing all VMs at the same time

Last touches

Create a browser webpage for the host

A single page accessing all VMs at the same time

I don’t know JS, but ChatGPT does

Last touches

Create a browser webpage for the host

A single page accessing all VMs at the same time

I don’t know JS, but ChatGPT does

VM startup: cron job to update index.html with uptime

Last touches

Create a browser webpage for the host

A single page accessing all VMs at the same time

I don’t know JS, but ChatGPT does

VM startup: cron job to update index.html with uptime

VM startup: tune nginx config to return that index.html

The final demo

256 VMs

(8x play speed)

Stats

Each VM is a single RISC-V core with 22Mb of RAM, 9Mb is kernel and initrd, rest is heap

Stats

Each VM is a single RISC-V core with 22Mb of RAM, 9Mb is kernel and initrd, rest is heap

RTX3070 (8Gb) can fit ~350 VMs

RTX4090 (24Gb) can fit ~1100 VMs

Stats

Each VM is a single RISC-V core with 22Mb of RAM, 9Mb is kernel and initrd, rest is heap

RTX3070 (8Gb) can fit ~350 VMs

RTX4090 (24Gb) can fit ~1100 VMs

VM boot time is ~1.5 minutes assuming there is no contention on GPU SM

Can be ~5 minutes under high contention

Stats

Each VM is a single RISC-V core with 22Mb of RAM, 9Mb is kernel and initrd, rest is heap

RTX3070 (8Gb) can fit ~350 VMs

RTX4090 (24Gb) can fit ~1100 VMs

VM boot time is ~1.5 minutes assuming there is no contention on GPU SM

Can be ~5 minutes under high contention

Stats

Each VM is a single RISC-V core with 22Mb of RAM, 9Mb is kernel and initrd, rest is heap

RTX3070 (8Gb) can fit ~350 600 VMs

RTX4090 (24Gb) can fit ~1100 1800 VMs

VM boot time is ~1.5 minutes assuming there is no contention on GPU SM

Can be ~5 minutes under high contention

Efficiency

Is GPU more efficient?

Efficiency

Is GPU more efficient?

TLDR: yes, if you need to emulate lots of VMs

Efficiency

Is GPU more efficient?

TLDR: yes, if you need to emulate lots of VMs

Gameboy case, for 16 ms of emulated time:

CPU: 0.8 ms

GPU: 22 ms

Efficiency

Is GPU more efficient?

TLDR: yes, if you need to emulate lots of VMs

Gameboy case, for 16 ms of emulated time:

CPU: 0.8 ms

tens of cores

GPU: 22 ms

thousands of "cores"

Questions?