Introduction and Background

Since December of 2021, I have been working hard on an exciting programming language translation project to meet my computer science degree requirements. All computer science students at Walla Walla University must complete a senior capstone project for a client of their choosing. The project has to be completed throughout the school year and consist of 120 hours put towards development and reporting. I chose to work on a project proposed and sponsored by Professor James Foster in the Walla Walla University Computer Science Department called Grail, a Python to GemStone Smalltalk translator written in GemStone Smalltalk.

GemStone is an object database system developed and sold by GemTalk Systems. It’s much different from common database systems such as MariaDB, PostgreSQL, and Azure Cosmos DB. It doesn’t store rows in tables or documents in collections, it stores raw objects, making the interface very simple, just write Smalltalk code and create objects. When you’re finished, commit your code and the objects you created will be stored. However, storing objects is not the only thing GemStone is very good at, it has a large feature set making it attractive to companies and Smalltalk developers with a large amount of data to store.

GemStone does still have limitations and one glaring limitation is that it is tightly coupled with Smalltalk. Currently, the only reliable way of interfacing with GemStone requires applications to be written in GemStone Smalltalk, a Smalltalk dialect. In order to make GemStone more attractive to a wider audience of developers, this project sets out to eventually translate Python to Smalltalk “on the fly”, making Python a viable language to use with GemStone.

Grail takes Python source files as input and outputs Smalltalk source code in the GemStone Smalltalk dialect. The Smalltalk source code can then be imported and run on a GemStone server allowing Python developers to use GemStone as a database, without switching their development language over to Smalltalk. The Python developer can use all of the great things Python offers while still harnessing the power of an object database.

Topaz, the command line interface for managing and developing on a GemStone server, allows us the extend the project even further by writing scripts to install said Smalltalk source code files and run them all with one script in a language of our choosing (I chose batch on Windows). Taking this further, we could make this script executable in VS Code for the currently open Python file. This means Python developers would not even have to leave VS Code to execute their Python code in their GemStone environment.

Methodology

The initial step when translating a language is to represent the source language in a way that is parsable. Programming languages are defined using an abstract grammar which provides a very rigid definition of every symbol, expression, and statement in the language. The abstract grammar is then used to generate an Abstract Syntax Tree (AST) for the entire program. Fortunately, Python provides us with a package that can export entire files (Python modules) as an AST that can then be “pretty printed” as a formatted string. This formatted string is then used as an input to a parser, the parser takes the string and creates AST node objects in the target language, these AST nodes in turn know how to output themselves as target source strings that are pieced together via consecutive calls to all of the AST nodes in the tree to create a target source file.

As a side note, technically the AST node objects could be created in a language other than the target language. In fact, though it would make things a lot more difficult, the entire translator could be written in a third mediator language. It would be more difficult because not only does the mediator language not understand the source language, it also doesn’t understand the target language. You would have to model the behavior of two different languages in order to effectively translate between the two. It is best to use the language out of the two source and target languages with the highest perceived complexity to model the behavior of the other language and do all of the translation. It really can go either way, but for this project it was decided to model Python behavior in Smalltalk. Though looking back, I think Python has the higher complexity.

It turns out that creating the Smalltalk source strings from Python AST nodes is a complex process that requires a lot of overhead Smalltalk code to model the behavior of Python. In Python, the root of the execution hierarchy is called the builtins package. This package can be imported and analyzed, but it mostly remains hidden as its where the language behavior is defined. In order to accurately model Python behavior in Smalltalk, builtins has to be completely re-written as a Smalltalk Dictionary (analogous to a Python package). Once builtins is accurately re-written, we can tell the AST nodes to translate themselves as strings that are made up of calls to builtin methods and instantiations of builtin objects such as int and str.

After the Python code has been translated, we then have to output the Smalltalk source created from the Python AST to some sort of executable medium. While files were beyond the scope of my project, we did get to a point where some fairly complex Python code was being generated as raw strings. This is still very powerful since in Smalltalk we can evaluate these strings as Smalltalk source code and test our modeled behavior.

Parser

While the first step of the translation process is really generating an AST from some Python source file, that part is trivial. Python provides a package that allows us to generate and pretty print an AST that we can use as an the input to our parser without any changes. Below is an example output from the AST package for a simple “Hello world!” program.

# print(ast.dump(ast.parse("print('Hello world!')"), indent=4))
Module(
    body=[
        Expr(
            value=Call(
                func=Name(id='print', ctx=Load()),
                args=[
                    Constant(value='Hello world!')],
                keywords=[]))],
    type_ignores=[])

With the AST in pretty-printed string format, we can use the string to create a new Smalltalk ReadStream. ReadStreams are string buffers that allow us to take a string one character or multiple characters at a time and preserve current position. The GemStone dialect provides us with methods for getting next characters, setting position, skipping characters, peaking characters, etc… Streams in general, both WriteStreams and ReadStreams, are also very efficient, much more efficient than raw string operations.

To parse, the AST in the ReadStream is recursively analyzed for familiar tokens that can be matched to AST nodes. When the match is found, the ReadStream is then passed down to the node which knows how to initialize itself with the values provided in the AST. Since the program is represented as a syntax tree, we have to preserve hierarchy. However, the recursive nature of this algorithm makes this simple. We simply add the current node as the parent to the next node that is being parsed during the initialization step.

With an AST that has been parsed and converted into Smalltalk objects, we can move on with the translation.

Translator

We accomplish the translation step by performing a second pass over the entire AST, but this time represented as Smalltalk objects rather than a raw string. Each of the AST node objects have a printSmallTalkOn: keyword method that takes a WriteStream as its argument. Each of these methods is specialized for each individual node so that it can output itself as a Smalltalk string. For example, the integer 5 would output itself as 'int ___value: 5', where int is a Smalltalk class modeling a Python integer as defined in the Smalltalk implementation of the Python builtins package mentioned earlier (more on that below) and ___value: 5 is a class-side method on int that creates a new instance of int and initializes the value instance variable to the Smalltalk integer 5. Also notice the quotes indicating that this is a Smalltalk String.

With the above in mind, we can then begin at the top of the tree (ModuleAst) and recursively work our way down, just like the parser, only this time we are calling each node’s printSmalltalkOn: method and we are passing around a WriteStream rather than a ReadStream. WriteStreams are just like ReadStreams only instead of reading a buffer, we are building a buffer.

Below is the same “Hello world!” program as in the previous section, only now it has been translated to Smalltalk source code.

| currentScope |
currentScope := Variables new.
(currentScope at: #print) 
    scope: currentScope 
    positional: { (str ___value: 'Hello world!'). } 
    named: Dictionary new

The above example looks complicated due to the overhead code that needs to be generated in order to handle scoping. I cover scoping in more detail in the Challenges section, but a brief explanation of what is shown should be covered here.

First, we define a new variable called currentScope and initialize it with a new instance of Variables. Variables is a helper class that inherits from SymbolDictionary and overrides some of its methods in order to change the behavior to support hierarchy. currentScope is then passed to the print method that is already defined in the currentScope where a new child scope is created to store any variables that are created within the scope of the print method call.

When calling a method that is defined in currentScope, we are actually calling a special method defined on a helper class called FunctionDef of which print is an instance. The method is FunctionDef >> scope:positional:named: where scope is the parent from which a child scope should be created before execution of the method, positional are the positional arguments as an array, and named is a Dictionary of named arguments.

The child scope is more easily explained with an example that contains a function definition. The following python code:

def sayHello(to, excited):
    trailingCharacter = '.'
    if excited:
        trailingCharacter = '!'
    print('Hello ' + to + trailingCharacter)

to = 'Will'
sayHello(to, True)

becomes:

| currentScope |
currentScope := Variables new.
currentScope
  at: #say_hello 
  put: (FunctionDef new 
          args: { #to #excited }; 
          kwonlyargs: { }; 
          vararg: #None; 
          kwarg: #None; 
          kw_defaults: { }; 
          defaults: { }; 
          block: [ :currentScope |
                    currentScope at: #trailing_character put: (str ___value: '.').
                    (currentScope at: #excited) ___value ifTrue: [
                      currentScope at: #trailing_character put: (str ___value: '!')
                    ].
                    (currentScope at: #print) 
                      scope: currentScope 
                      positional: { (((str ___value: 'Hello ')
                          __add__: (currentScope at: #to)) 
                          __add__: (currentScope at: #trailing_character)). } 
                      named: Dictionary new.
            ]; yourself).
currentScope at: #to put: (str ___value: 'Will').
(currentScope at: #say_hello) 
      scope: currentScope 
      positional: { (currentScope at: #to). True. } 
      named: Dictionary new.

Note that the above generated Smalltalk has been cleaned up a bit. There are still some issues with formatting during the translation step.

As you can see, the method sayHello becomes an instance of the FunctionDef class. While the creation of the child scope is obfuscated within the implementation of FunctionDef >> scope:positional:named:, you can see that the currentScope within the block that is executed by that method is not the same currentScope as the root of the program.

Lastly, I very quickly skipped over the explanation of the print built-in method that is already stored within currentScope. Whenever a new Variables instance is created outside the context of a parent (i.e. outside of creating a child scope), we perform some initialization steps to insert built-in methods into the scope. Currently, print is the only built-in method that has been implemented in the builtins mock library so it is the only method that is added to the top-level scope during initialization.

Builtins

In the above section, I showed some sample code that has references to some instantiations of some mysterious objects such as int and str. These objects are Smalltalk objects that mock Python objects. Since the built-in classes in Smalltalk behave differently from the built-in classes and methods in Python, we need a way to mock the behavior of Python in Smalltalk. We do this by completely rewriting Python’s built-in types in Smalltalk.

The base of Python’s execution hierarchy is builtins, which is just a package that you can import like any other package and expose all of the types and methods to analyze it. This package is where all of Python’s built-in behavior is defined. In Smalltalk, we created a new Dictionary called builtins and defined every Python type and all of each type’s methods, mocking the behavior of each type.

We also need a way of creating these types and interfacing them as Smalltalk objects so there is a need to create Smalltalk-specific methods that we prefixed with a triple underscore. In the above section, I gave an example of int ____value: 5. This statement essentially says: call the Smalltalk interfacing class-side method ___value: on the Python class int with a Smalltalk argument of 5.

The string that is output from the translator can then be executed using the evaluate method on Smalltalk’s String method. When the string is executed, it will start creating built-in objects, calling built-in methods, creating scopes, etc… basically everything involved in executing a sophisticated Python script.

Output

When we started the project, we asked the question, “in what format would the translated source code be most helpful?” We thought about files and immediate execution. Files would provide more useful diagnostic information if something about the translated program isn’t working, but immediate execution is very convenient. Though, we could also output files and immediately execute the file so there is both the raw source code to analyze and the convenience of the immediate execution.

However, since a well-defined deliverable output became too far out of scope for the project for the time limit I was restricted to, we landed on an immediate execution of source code strings for demonstration purposes. I wrote a fairly simple batch script to login to Topaz and run the currently open file in VS Code on my Windows machine. Since this script is for demonstration purposes only, it is pretty machine-specific so it may or may not work in other environments depending on how they’ve been set up. Perhaps the environment is exactly like mine, but likely not. Below is a demonstration video of the script in action to translate the “Hello world!” file.

Challenges

The project began by having to port the mock builtins package from Pharo Smalltalk to Gemstone Smalltalk. The two dialects have some differences so the majority of the existing test suite was failing. The package is also still incomplete so programs that are translated using Grail have some limitations directly corresponding to the types that haven’t been written or completed. I had to begin by taking inventory of the tests that were failing and take one test at a time, fixing the method being tested and/or the test itself. I also had to expand and change some tests to either be more consistent with Python’s behavior or to test more thoroughly.

The next suite that was failing was for the parser. When working with this suite, most of the tests themselves could be left alone, but there were issues with some of the initialization code that was the result of changes made before I began on the project. Then came the translator, which was a completely new part of the project that needed a completely new suite in order to test it. I built this suite as I went, writing tests for each of the node’s translation functions as they were implemented.

Because of the issues that arose when porting the builtins library and problems with the parser, my original schedule was not very closely followed. The project turned into getting as much done on each of the pieces in the time I was given. This actually worked pretty well, but many of the things on my original schedule were not even touched. It is also very difficult to accurately describe a project’s requirements and give an accurate time estimation before starting. Some of the items we put in my schedule were either in the wrong place, didn’t need to be there at all, or way out of scope.

Another challenge I faced was complex portions of Python’s behavior that made translation non-trivial. One really big issue was with scope and variable and function definitions. All three of these are pretty tightly coupled because they are all defined within the scope. In Python, there is the global scope that lives at the module level and then there is a hierarchy for each level of function definitions. For example, when a function is defined at the module level, a new scope is created as a child of the module scope. When a variable or function is defined within this child scope, it is local to that child scope. To complicate things further, Python has global and local keywords that can modify this scoping behavior. For instance, calling global on variable x in the child scope will move x’s definition to the module level scope.

While the global and local keywords became out of scope for my project, I did get a fairly well working implementation using a custom helper class called Variables that subclasses SymbolDictionary. Representing Python’s scope using key-value pairs in what is essentially a tree of dictionaries turned out to be a very elegant solution.

Development Process

The recurring process in this project is most closely related to Test Driven Development (TDD). Since I was working by myself with design input from Professor Foster, I mostly just needed a process that would help me keep myself accountable for the code I wrote. Along with this, one of the requirements of the project was to continue to grow and maintain the existing test suites. As I worked on the failing tests in the builtins package, I tried to touch every failing test, fixing issues with the test itself, issues with the behavior it was testing, and perhaps change the test if it wasn’t testing the proper behavior or was missing an edge case that I caught. The same goes for the parser package. When it came to the translator, though it was a completely new module being added to the project, the tests were an addition that grew the parser test suite.

In addition to careful testing, I also needed a good feedback loop to ensure I was on the right track. While working on the project, I would regularly (at least once a week) meet with Professor Foster to discuss any problems that I found, what new features or improvements should be added next, and to review what I had completed in the time since our last meeting. Since he was sponsoring the project, it was necessary for me to ensure that his requirements were met and that the project is consistent with his specification.

Version control was also used heavily during development. While we may have benefitted from discussing alternative methods of using the Git repository, all of my additions were added to a single branch to be reviewed and merged after I finished my project. If this were a multi-developer project, this method of using Git would not have worked well, but since I was a solo developer, there was not much concern about conflicts and my job was to be as descriptive as I could be with my commit messages.

Conclusion

GemStone is a very interesting technology that I think is quite innovative and unique. Being able to store objects in a database directly without having to write the overhead code for a database module in each application that needs it is very convenient. It also doesn’t really come with extra technical cost since most SQL databases need to be hosted on a server anyway.

However, Smalltalk is a barrier that most developers are not willing to attempt crossing. While the modern Object-Oriented languages took bits and pieces of the ideas that drove Smalltalk to be developed, the language itself never really became popular and asking developers to switch their primary language to Smalltalk in order to use GemStone is a big ask.

By decoupling GemStone from Smalltalk (or at lease obfuscating the underlying reliance on Smalltalk), a new market opens up to developers who like the idea of an object database, but want to use their language of choice rather than Smalltalk. Translating Python is a great step towards this goal and, as one of the most popular languages there is, captures a lot of those developers.

This project was a great opportunity for me to learn a lot about the internals of Python and GemStone. It also gave me a chance to learn more about parsing and translating languages, reinforcing the knowledge I gained in my courses that covered these topics. As someone who has had a chance to use both GemStone and Python pretty extensively, this project is also very exciting since it could become a very useful tool for me as a developer when it’s finished. This project was a great success from my point of view and I’m excited to watch its continued progress.

Resources