forked from deater/ll_asm
linux_logo in 26+ kinds of assembly language
License
BigEd/ll_asm
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
yes, I am insane Tired of waiting many miliseconds for linux_logo to run? Tired of wasting 35k of disk space? Upset that to run linux_logo you need huge GLIBC? Well your worries are over! With "ll" [even the name was shortened to save space!] you get all of the benefits of linux_logo in a smaller, faster package! "ll" is written entirely in native Linux assembly language! Some Statistics --------------- *NOTE* not all architectures implement the same feature-set (IE, not all have MHz in /proc/cpuinfo) so this is only a rough comparison. Processor lzss executable --------- --------------- ia64 2826 bytes (as 2.18.0.20080103) alpha 1821 bytes (as 2.19.51.20090723) RiSC 1418 bytes (RiSC-as 0.0.2) parisc 1400 bytes (as 2.17.50) SPARC 1397 bytes (as 2.19.51.20090714) microblaze 1298 bytes (as 2.16) mips 1276 bytes (as 2.19.51.20090714) m88k 1240 bytes (as 1.92.3) arm.oabi 1186 bytes (as 2.18.0.20080103) PPC 1165 bytes (as 2.19.51.20090805) arm.eabi 1161 bytes (as 2.24.51.20141001) 6502 1130*bytes (ca64 2.12.0) arm64 1094 bytes (as 2.23.1) s390 1064 bytes (as 2.19.51.20090805) x86_64 1030 bytes (as 2.23.52.20130727) x86_x32 1006 bytes (as 2.24) 1802 1010 bytes (asmx) sh3 994 bytes (as 2.17.50.0.5) m68k 982 bytes (as 2.18.0.20080103) x86 969 bytes (as 2.19.51.20090723) vax 950 bytes (as 2.16.1) arm_thumb 924 bytes (as 2.24.51.20141001) avr32 914 bytes (as 2.16.1) arm_thumb2 908 bytes (as 2.24.51.20141001) crisv32 905 bytes (as 2.12.1) z80 891 bytes (as 2.20.1.20100303) pdp-11 890 bytes (as 2.19) 8086 780 bytes (as 2.20.1-system.20100303) * the 6502 results were adjusted to match the code present in other architectures (i.e., not counting the graphical routines) The various implementations have varying functunality and often use different methods to get system info. Still, some gross comparisons between the architectures can be made. Individual architectural comments, in descending order of executable size: ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ia64: ia64 is VLIW. no divide at all. you use the fp unit to do int multiply no unaligned words. on the plus side, basically have infnite (well, 128) registers does have an auto-incrementing load/store. The actual ia64 architecture is too bizzarre for words. It probably doesn't make any sense at all unless you are a computer architect. The instructions come in groups of 3 (40-bits each, with a total instruction size of 128bits). These run in parallel. So if your instructions don't parallelize, you end up running lots of nops. In any case it's no suprise the executable ends up being so huge. It very probably can be optimized a lot from here. If I really cared I'd turn off the automatic bundling and set all of the instruction bundles by hand. I found a bug in the assembler where it puts two call instructions in the same bundle (which can't work). I wonder if thise means I am one of the few people who writes programs entirely in assembly for ia64.... alpha: Alpha is hurt because it implements some "optional" features, such as MHz (which added a lot of code) and also counting num of cpus, with proper pluralization of Processor. The *big* hurt though is lack of byte-manipulating instructions. The original alpha architecture did not support operations on bytes, you have to do a lot of shifting and masking. Unfortunately ll uses a lot of byte-sized memory operations. On the x86 the instruction "lodsb" is 1 byte in size. On the PPC the equivelant is "lbzu" which is 4 bytes in size. On the alpha, the instruction is "ldb" which expands to a "lda","ldq_u","extbl","sll","sra" sequence, plus an add instruction that x86 and ppc do automatically. Thus, taking 24 bytes. It's a bit better if you do an unsigned load, which is only 16 bytes, but still. A store byte actually does a 32-bit load, masks in the byte by hand, and then does the actual 32-bit store :( The "jump if bit one set" type instructions help the lzss. The immediate field for ALU ops of only 8 bits really hurts There is no native integer division routine on alpha. The GOT section is a huge space-sink. It has 64-bit constants, which most of the time you shouldn't really need. RiSC: This is a 16-bit architecture used in my undergrad computer courses I took at University of Maryland. http://www.ece.umd.edu/~blj/RiSC/ No logic instructions besides nand. This means masking with an "and" takes at least 2 instructions (because there is no immediate form). 16-bit memory acesses only. It is quite complicated to load a byte. Only 7 registers, one of which is always 0, and one of which is the stack pointer. No shift instructions. Left shift can be done with add, but right shift takes a fairly complicated routine. Only able to branch +/- 64 instructions, which can be a limit. Only a 6-bit immediate, though the lui instruction helps. parisc: + really hurt by its short immediate field. Most addresses require 2 instructions to load, even if try to use relative-data add. + no integer divide, have to code it up... not so bad in loop form + delay slot that can be nulled out + some ALU instructions can also null out following instructions conditionally + compare immediate instruction only can handle 5-bit immediate + loads/stores must be aligned + no AND immediate instruction sparc: Condition codes make for tighter code. register/register load address calcs also help. 13-bit immdediate hurts a bit. Crazy register windows a bit hard to understand SPARC is unfair a bit, because my test machine is a 24-proc niagara, so it has extra code to handle that many chips properly. 65c816: Another non-Linux addition. The method of switching the A and X, Y registers into and out of 16 bit mode is a PAIN. A non-intuitive opcode, and if you aren't careful your routines break if you're in an unexpected mode. Would have been much better if somehow there were separate opcodes for 8 vs 16-bit instructions. 6502: So obviously this isn't running on Linux. I was curious how an 8-bit processor would compare. The big problem is that the LZSS algorithm and the ll data set are very much 16-bits in size. So there is a lot of overhead having to increment 16-bit values on an 8-bit processor. Only having 3 registers is a handicap, but the zero page (the first 256 bytes of memory which can be accesses in one less byte and in fewer cycles) act as almost virtual registers. Some potential useful instructions that would have helped (and that are actually implemented in the later 65C02 version of the chip): phx/plx (push/pull X directly) ina/dea (increment/decrement A... otherwise need 2 instructions) bra (branch always, relative jump. otherwise need 3-byte 16-bit) stz (store zero) The 6502 code uses high-res graphics to approximate the color ascii art. When calculating total image size, we don't count the overly-large graphics code and instead only count the code that would be used to print the text, say, out over a serial port to an ANSI capable term client. microblaze: Similar to MIPS. Branch delay slot is optional, which helps. Only register+register and register+immediate address modes. The branch+link instruction returns to the same instruction that left from, so for the return you have to specify to add 4 or 8? Very weird. The signed 16-bit immediate makes it impossible (I think) to load 0xff00 into a register w/o two instructions. Can't get the assembler to do pointer math into an immediate. mips: Recent binutils has made mips come in line with the other architectures. It is the most RISC of the RISC architectures. Thus it ends up having a very non-dense instruction set. On the plus side, it has hardware support for unaligned loads, plus hardware integer divide, which help a lot. m88k: This chip came after m68k and was motorola's first RISC chip. Didn't go well, they moved on to PPC. It's similar to a cross between MIPS and PPC. OpenBSD because there is no Linux port to m88k. Using gxemul. OpenBSD forces syscall error handling by skipping a jump-to-error-handler instruction on return. This makes each syscall in effect take 8 bytes. Branch delay slot is optional (for performance only?) Starts at low address space. This is a huge win as it makes each pointer load 4-bytes instead of 8 assuming we stay smaller than 64k. bcnd instruction saves a cmp usually cmp quite elaborate, setting 16-bits of comparison info in a reg, not in flags. a more powerful version of the equivelant Alpha instruction. arm64: New ISA different from arm32/thumb/thumb2 Fixed-width 32-bit instructions Wider constants help *a lot* Unaligned loads help too New instructions, such as tbz also help The conditional instructions like csel might, but at least in the lzss code doesn't make a difference. The compare/branch/zero instructions also help. Lack of conditional execution hurts What we really need is an "increment multiple" instruction. inc {r1,r2,r4} And also an equivelent of x86 loop No open syscall, have to waste an instruction setting AT_FDCWD and use openat() instead. arm: no integer division routine Really painful to load constants > 8 bits that aren't powers of two or else 8-bit values shifted by power of two. If we had integer divide, saner constant support, and unaligned loads we could probably beat x86 even with 32 bit instructions. OABI smaller than EABI because don't need to load r7 with syscall number. ppc: The PowerPC has very CISC-like opcodes as well. Despite being load-store with 3 operand instructions, you almost wouldn't know it was considered RISC. I also think I could optimize the code a bit more and challenge x86. The big help is auto-incrementing load/store byte instructions. s390: This is the most CISC architecture I've ever seen. If only it had a "load byte" opcode it would definitely beat out x86. I am sure it can be optimized even smaller than x86 by a s390 expert. Being able to do "strcat" in 2 or 3 op-codes and strlen in not more than 5 is a big plus. + fact all opcodes are often 16-bit and often 32-bit is annoying + not having 3-operand opcodes also hurts + crazy CISC operations are amazing, but often don't do what I need + would be nice if offsets could be negative + would be nice if there was a relative branch shorter than 32-bits x86_64: When doing a straight x86 -> x86_64 conversion (which involves making all of the push %e?? instructions into push %r??, as well as jmp *%edx into jmp *%rdx) makes the code 28 bytes longer, due to the "inc" instruction becoming 2 bytes, and extra addr32 prefixes being added to various move instructions. Switching the syscalls to native syscalls is about neutral. You do have to make sure to save %ecx across syscalls then. The sad part is we have 8 extra regs, but can't use any of them because the extra byte prefix is a killer. Also added in a few bytes extra to print the name better (gratuitous spaces on some cpuinfos). Also we have to handle 4GB of RAM so we lose a few bytes for a 64-bit load. x86_x32: This code is more or less the same as x86_64 The x32 changes don't really help us much, as we weren't using 64-bit pointers before, and we try not to use the R8-R15 because they take extra bytes. In fact, almost all of the size saving comes from the fact that a 32-bit ELF header is smaller than a 64-bit one. The move to having all syscalls with bit 30 set hurts us by about 10 bytes or so. 1802: RCA cosmac 1802 This is an interesting architecture. No dedicated PC (you can set any of the 16 index registers to be PC) No stack (though there are auto-inc/dec insns to help). This makes function calls interesting. The traditional way is to have dedicated index registers for each function, then jump to just before the beginning before returning (to "reset" that particular PC). This is only workable if you have < 8 or so functions and no leaf functions. Otherwise you have to emulate a stack with fairly high overhead instruction count wise. Otherwise nothing unusual when optimizing, except for the always troublesome problem of doing 16-bit math when you can only do math in the 8-bit accumulator. vax: vax is crazy CISC. Some of the CISC instructions: + can operate on variable sized bit-fields + an asm instruction that implements switch/case statements + a fp instruction that calculates polynomials + special instructionss for handling queues + various opcodes to accelerate COBOL (edit, etc) + xfc - extended function call, create your own opcodes You can do strlen with essentially one instruction, though it's a long one. vax could easily beat x86 if it had a few one-byte instructions. sh3: auto-increment addressing for loads but not for stores? -> yes, auto-incrememnt/decrement set up for stack accesses so it decrements on store (push) and incs on load (pop) branch delay slots make things difficult could really use a compare-with-zero instruction for reg other than r0 pretty compact code, even with lots of wasted branch delay slots Really wish could put the divide instructions in a loop (like parisc) m68k: is even more CISC than x86 if such a thing is possible (if you don't count Vector instructions). In addition to BCD instructions it also has a wide variety of bit-field manipulation instructions, plus full ALU complement. bizzarrely, m68k assembly is very similar to THUMB. weird having separate address and data registers. can't shift by an immediate more than 8? can't add carry with immediate? have to clear upper parts of words when doing byte math; no equivelant of the mips "lbu" instruction. x86: The x86 code is currently the smallest, mainly because I had a running contest for a while with Stephan Walter until we got it below 1k. It does help that there are a lot of useful 1-byte instructions in the x86 command set, which give it an instant advantage over all of the RISC chips. Lack of alignment makes string manipulating programs (like ll) a lot easier, as you can store 16 and 32 bit values w/o having to worry if the string is properly aligned. arm_thumb: I tried by hardest to beat x86, even though the arm port doesn't have to do things x86 does (SMP support for example). I came close, but not close enough. The lack of an integer divide instruction and the lack of unaligned memory reads killed it. I do like the thumb instruction set, it is in many ways more powerful than x86 while cleaner at the same time. There is a powerful push/pop instructions that can push/pop any combination of registers in 16 bits. The "blx" instruction to branch to a register (even a high one!) is great. I cheated a bit by using the Arm5 instruction subset. The code wouldn't be anywhere near as small if I had to use generic arm4 thumb. arm_thumb2: Is same size as thumb if you make sure to use the narrow forms of all the opcodes. The "cbz" instruction saves space, but oddly only works for forward branches. The "itt" conditional instruction does help, as does "movw" and "movt" to load 32-bit constants. z80: could really use a 16-bit dec that updated flags could also use more useful 16-bit arithmatic insns could use reg+reg addressing mode nice if we could do ALU ops on other than 8-bit accumulator z80 is nice for pascal-type strings, not C-type Need 16-bit shifts. And shifts by more than 1 Need better way to set regs to zero. Some extra bytes taken to handle CR/LF instead of just LF The string copy routines are almost useless, having almost as many bytes to set up as doing things discretely. Really hurting not having just one more register that cat be accessed with one-byte opcodes (or else an IX+REG addressing mode) avr32: They specifically designed the arch to have compact assemley. The "ret" return instruction is the most useful ever. It can handle returning a value, as well as having a special case to return 0 or -1, and also sets the status flags. The one weakness is that almost no instructions can take immediate values. It also has a great "load halfword and swap bytes" which would be great, only it has to be an aligned halfword so we can't use it :( Has the advantage that binaries start at a low address, so the addresses of functions fit in a small number of bits. The new champion for size ;) And there's probably a few bytes lurking that can be removed still. crisv32: mostly 16-bit instructions If constants are > 6bits need longer encoding reg+offset addressing mode only available for acr (r15) register, which increases code size. This is an improvement on original cris where r15 was the PC. VM usage starts at 0 No hw divide instruction Branch delay slot branch instructions always 3bytes. Use jump (register val) instead? pdp-11: Despite wealth of addressing modes, some critical ones are missing, such as being able to add two regs to equal address, or reg and value *before* dereferencing. really could use logical shifts really could use a real AND instruction severe register pressure UGH it's a pain working in octal extensive use of "adb" the a.out executable format makes for small binaries 8086: This port targets a DOS COM file. A COM file basically has zero overhead, which is why things are very small. A COM file only has one 16-bit segment, so pointers are all less than 16-bits which helps size. 8086 support is simpler than the x86 support which has to handle more complicated cpu_info files, plus some things DOS doesn't support like hostname. 8086 was a direct port of the 386 code, with 32-bit registers switched to 16. Some changes had to be made, as the 16-bit instruction set isn't as orthogonal as the 32-bit one so some register combinations (especially in effective address calculation) aren't allowed. Features: -------- + Runs in 4 miliseconds, more than twice as fast as the 10 linux_logo takes on a K6-2+ 450! + Takes up only 969 bytes when super-stripped on x86! Amaze your enemies! Impress your friends! BUGS: ----- No pretty-printing: This means that your computer is reported just as /proc/cpuinfo reports, ugly model-name, off MHz, and all. Possibly kernel-dependent: I only tested this on 2.4 and 2.6 kernels. The sysinfo() syscall changed between 2.2 and 2.4 Custom Logo: ------------ Point the "ANSI_TO_USE" variable in the Makefile to any text or ansi file you want when building. HOW TO HELP: ------------ If you have a Linux box running on an unsupported architecture, offer the author a shell-account so he can create a version for your type of machine! Useful Resources: ----------------- http://deater.net/weave/vmwprod/asm/ll/ll.html http://www.linuxassembly.org http://www.deater.net/weave/vmwprod/asm http://www.deater.net/weave/vmwprod/linux_logo Publications using ll: ---------------------- * V.M. Weaver, S.A. McKee. "Code Density Concerns for New Architectures", 27th IEEE International Conference on Computer Design (ICCD 2009), Lake Tahoe, California, October 2009. * V.M. Weaver, S.A. McKee. "Optimizing for Size: Exploring the Limits of Code Density" (Poster), Architectural Support for Programming Languages and Operating Systems (ASPLOS '09), Washington DC, March 2009. Thanks to: ---------- Shellcoders. You seem to be the only useful resource for linux assembly on the various platforms. Special Thanks to: ------------------ my lovely wife my beautiful daughter my two sons AUTHOR: ------- Vince Weaver <vince _at_ deater.net> http://www.deater.net/weave
About
linux_logo in 26+ kinds of assembly language
Resources
License
Stars
Watchers
Forks
Packages 0
No packages published
Languages
- Assembly 73.9%
- C 18.0%
- Makefile 4.3%
- TeX 2.6%
- Shell 0.9%
- Visual Basic .NET 0.3%