Skip to content

8l/openvmtil

Repository files navigation

openVm : Tookit for Implementing (and exploring) Languages - a bottom-up, optimizing native code vm, that is an extensible, contatenative, RPN, scripting language 

An Exploration of Language Theory - and its Machine Implementation

Imagine a low level, optimizing, virtual machine (like llvm or jvm or clr) that is an extensible scripting language and that is small enough to be easily verified where even the runtime is reconfigurable, extensible language. Now minimize that, with a maximally open, extensible design. Or maybe the best ideas from Forth, C, Lisp, Java, Smalltalk, JavaScript, Prolog, Assembler, ... combined as the base (IO monad) for a functional language like Ocaml or Haskell. But basically it is a small collection of small compiler related software tools that are used alot and keep getting refined with use. A major part of the goal here is to produce a simple, elegant language compiler that can be totally verified or proven correct and so that someone else can pick it up and further refine it, as a kind of meditation/exercise on language and logic and thought and complexity and computation and art and reality and ...

Current focus (to do) : from a working prototype of a basic system ... : minimal bootstrap, self-hosting, patterns/sets, logic, tail call, type checking, gui

Vision <: < forth, c/c++/c#/d, lisp, smalltalk, self, prolog, java, yacc, apl/j, ml, bash > (c, machine code) :=> maru, joy, teyjus, javaScript, ometa, ocaml, haskell (w/debugger) :> : a distilled computational semantics essence as a vm common ground; inclusive, bottom-up (from machine code not assembly) vm that is also a forth-like (rpn) scripting language upon which to build an elegant parser (cf. ometa) for any language or vm (nlp, llvm, ocaml, haskell, jvm, clr, arm, etc), and libraries -> ai/game engine, os, browser, etc., ... ; sufficiently fine grain semantic primitives that can be combined to form the keywords and semantics of any high level language; the keywords and grammar specifics for each language in separate namespaces, for example. A cross-platform, extensible vm that can be as fast as hand-coded assembly (sometimes faster than -O3 C/C++) and is also a scripting language - a universal, extensible language with optional compilation to .o files for c linkage or stand alone executables. Recognizing the incredible utility of the list data structure and set theory makes lisp - lambda calculus the interface here to human language and thinking; "forth" is the interface to the hardware, with C in the middle. A sort of high level, rpn, macro assembler, a "Turing tarpit" explorer/miner - (really, exploring new runtime implementation ideas in computer science) ; bringing ideas from forth together with  smalltalk, self, lisp and prolog in c and machine code is a long term goal. 

The Turing Machine, the Lambda Calculus and Type/Category Theory are the theoretical foundations.  Levels : 0/1 : Hardware -> MachineCode - Intel - ARM - Turing -> Forth - Concatenative - Joy - Moore - von Thun ->  Lisp - Church - McCarthy -> yacc - Generative Grammar - Panini - Chomsky - -> Prolog - Curry - Logic -> Smalltalk - Category Theory - Eilenberg - MacClane -> Type Theory - Milner - Harper - Sml - Ocaml - Haskell -> ? -> Nonduality -> Integral - Wilber -> Infinity - ...

With an ideal that the best (cross platform) virtual machine or common language runtime is a minimal but maximally extensible, optimally and simply compiled (rpn) language, certainly "human readable", but also maintainable, learnable and extensible. But not as fundamentally one huge tool rather several small, relatively easily understood (groups of) tools working "organically" together - maybe most like forth but not limited by its assumptions with an action, constructive, granular, transparent, operational semantics of a higher level functional, denotational semantics. This, we feel, would also have the potential to be a step toward more cooperation in the (linux) programming world. Ometa, Yacc, uml, categorical logic, etc. with whatever syntax on top of this.

A minimal, extensible, cooperative, sustainable,  all level capable language - machine code to scripting, compiler-interpreter, concatenative, anonymous subroutines, nested locals, dynamic variables, optimized machine code compiler, debugger, history, tab completion, ..., "organically" growing, experimental. All the "keywords" and syntactic characters are user changeable. Different problem domains can not only have different classes/objects/morphisms but can have different syntax.

Currently working :  a concatenative, forth like system with an optimizing native code x86, register machine, compiler, single stepping x86 debugger (using udis for instruction disassembly), classes - namespaces, interpreter, history and tab completion, bignums, nested locals, readline, raw readTable, lexical readTable, embedded languages in C (maru, retroForth, joy, chibi-scheme, sl5, femtolisp/tiny, ...) ... 

Current Focus and Direction : 
  # self-hosting : list objects and a lisp like eval, ometa/yacc/bison support : support for a midlevel language modeled after scheme/lisp/racket to support a higher level modeled after sml/ocaml/haskell and prolog/teyjus/hol + ometa/peg/yacc/bison for a sort of applied unified field theory of programming languages, similar in vision to maru by VPRI-Ian Piumarta ; we have a forth like system and we are adding lisp/smalltalk/prolog semantics.
  # OS interface - linux/plan9/etc : file, terminal, gui, process/threads, audio, opengl, browsers, html5, flash, game engine, etc.

Goals : Explore the power of simplicity as a software design principle - how simple can it be and still be totally effective. A correct, reliable, fast, small, and extensible machine coding compiler supporting any and all high level language semantics with an integrated machine language debugger. An optimal language core for fundamental (Turing Machine + Push Down Automata) semantics - the one language that holds us all together is, of course, the Turing Machine.  Based on this small layer, simple as mathematically possible, and transparently correct, add a (minimal) set of primitives, clearly defined, easily understandable, extensible, easily learnable, computationally complete and efficient, compiling to native machine code or a reflective vm. So it's languages all the way down and up but integrated. Forth + C + joy + smalltalk, for the core semantics, syntactic compiler-compilers above for anything else, ie. javaScript, io, lisp, prolog, perl, python, ruby, (anything using a virtual machine), etc. Simple syntax - concatenation (left-associative (LA) grammar -- believed by some researchers to be closest to natural language), understandable, not trivial but easily learnable, only a minimimal number of fairly simple, useful ideas have to be mastered. Fast, small, simple interpreter, compiler, lowest levels easily verified by inspection, and a maximally minimal set of primitives, or operators capabable of interacting with the cpu, operating system, gui, other compiled or scripting languages, shells, etc.; ie. at any level and across levels. Using some useful ideas in formal language theory and (combinatory) logic (blocks/quotations) and LA grammar, I want to develop complete, simple and sufficienty fine grain access to the advantages of machine code for a language designed to transcend and include machine code, a sort of higher level assembly language. Using ideas from joy, forth, factor, smalltalk, lisp mainly, with some keywords from c/c++/c#. '{' and '}' used for blocks/quotations because blocks are anonymous subroutines, like blocks in C. Combinators operate on blocks/subroutines and are postfix. This software is especially for those who want to be close to the implementation so it can be creatively adapted.
  Optimizing native code compiler ("assembler"), single stepping x86 debugger (using udis for instruction disassembly), classes - namespaces, interpreter, history and tab completion, bignums, nested locals, readline, raw readTable, lexical readTable... 

Status : the core seems stable now, with some beta and alpha parts - still adding new features. It currently provides fairly easy, low level machine code access with high level language constructs (blocks, with nested locals, conditionals, 'for', 'case/switch', etc.), interpreter with single step native code debugger, powerful readline with history, tab completion, etc. It is currently only using and an advanced "instruction pipeline" optimization techniques but still achieving 2.3x -03 gcc speeds worst case. Consider also that the actual code base is only a few hundred K compressed so learning and changing things is relatively easy. The focus has been on basic bottom level infrastructure, performance and transparency not yet on expressiveness and data. 

The high level syntax work has not been done yet so the syntactic form is still minimal postfix with a few prefix words (mini-languages in themselves) and may not look to some as elegant to some as C#/Java/C++ or the newer languages like JavaScript, OCaml, Haskell, Go, Vala, etc. But postfix languages need no parenthesis and have elegance and power in minimalism. CfrTil has smart operators and other experimental syntactic forms. Even so i am interested in supporting a compiler compiler for any desired language syntax/grammar. An idea i am using is that with optimization off it has a simple, basic forth look (syntax and semantics) but with optimize on it is not only much speeded up but also has "smart operators" that can allow both a forth "look" and more of a C "look" to the code in some ways. Variables will be interpreted as lvalues or rvalues by the operators and the forth '@' is implicit (eg. ++/--). This is an evolving idea and it could also be more controlled by switching namespaces.
This project is for me research into logic, language, cognitive and computer science; experimental, a sort of virtual research and exercise (fun) space. Recent work has mainly been on the lowest level, the compiler. Probably not for beginning programmers. Has a readline, and rudimentary signal processing built-in. You can crash it but it instantly restarts. 

NB : The above stated Goals are not necessarily to be taken as the current Status of the project. For now only the core functionality is (almost) always there but it is feeling stable now. Verify and evaluate its usefulness for yourself moment by moment, it is not guaranteed for anything. Use at your own risk.


Developed with Netbeans & Eclipse CDT, using gdb, on Kubuntu using make, gcc/g++, cproto (http://sourceforge.net/projects/cproto/files/latest/download), linked with udis ( http://prdownloads.sourceforge.net/udis86/udis86-1.7.1.tar.gz ; configure CFLAGS=-m32 --enable-shared) and gmp (ftp://ftp.gmplib.org/pub/gmp-5.1.2/gmp-5.1.2.tar.lz ; configure CFLAGS=-m32 ABI=32 --enable-shared). 

CfrTil (pronounced "C-fertile") - Compiler Foundations Research - Tookit for Implementing Languages : is a name chosen for a C based first implementation of an openVmTil, strongly influenced by Forth, and simple Reasonableness - logic. The Til originally refered to Threaded Interpretive Languages but i outgrew that name and all forth related threading techniques when i went to optimized native register code (the unoptimized stack based code can be used if you turn off optimization). Note : g++ somehow currently breaks this code with any optimization on (-O1 -O2 -O3), only the plain version (cfrtil no optimization) and the debug/default version, cfrtil-gdb, work correctly. 

Acknowledgements : Special thanks for : Haskell B. Curry (Foundations of Mathematical Logic), Alan Turing, Alonzo Church, John McCarthy, John Backus, Noam Chomsky, Donald Knuth, Martin Gardner, Panini and Jiva Goswami (Sanskrit grammarians), Starting Forth, Threaded Interpretive Languages, Compiler Design in C, Category Theory, Combinatory Logic, Lambda Calculus, Turing Machines ; "Strange attractors" : sml (ocaml, f#, scala, haskell), teyjus (prolog, hol, hol light, coq, agda), lisp, (scheme, clojure), factor (retro, reva, jonesForth), self (smalltalk), llvm (clr, mono, jvm, plan9), mathematica, joy, javaSript, j, perl, bash, ... ; also the developers of linux, xorg, forth, gcc/g++, ubuntu, kde, netbeans, udis, gmp, cproto, and so many others on whose (code) shoulders we stand.

I think of language design and coding as an Art form as well as a Science.
I feel all computer languages are derived from (applied) logic ~ currently best represented by the Turing Machine, Category Theory, the Lambda Calculus, and Type Theory - (our one universal language?).