SEJ Developer’s Journal – Old Entries

These are the oldest entries of my journal regarding the development of AFC. The newer entries are on the main journal page. Note that the former name of AFC was SEJ (Spreadsheet Engine for Java).

May 26, 2006

I am working on an introductory example for SEJ for the tutorial (because I need this to streamline the docs and tests for binding and support for multiple data types in the interface). This prompted me to resume work on the simple cell binder. What this thing really needs to do is the following:

Scan the output interface. For every method returning a type SEJ can handle, try to find a correspondingly named cell (GetXY or just XY). If found, bind it. If not, check if the method is abstract. If so, raise an error.

Scan the remaining unbound cell names. For every name, check the input interface for a corresponding method. If found, bind the cell to it. If not, raise an error. This can be restricted to scan only a subset of names following a given name pattern, for example, all starting with I_.

For both inputs and outputs, I could extend this so parametrized methods can be bound. For example: I_TURNOVER_2 would be bound to getTurnover( int yearsBack ).

May 24, 2006

Automatic Conversion of Numeric Types

I discussed with Marcel and Igor that SEJ should allow different numeric input and output types and automatically convert them to the internal numeric type used in computations. This would make input and output interfaces reusable for computations with different internal numeric types, as well as being more convenient. For example:

public interface Input {
  double getRate();
  int getNumberOfItems();
  long getPrice();
}

should be usable with both double, BigDecimal and scaled long engines. This is not so easy:

You must be aware that for scaled long and BigDecimal, the double returned by getRate() might be truncated.

The long returned by getPrice() is hard to interpret. In a scaled long engine, it should be treated as an already scaled long. Otherwise, there is no way an application can pass properly scaled longs into such an engine. For other engines, however, the long would more naturally be interpreted as a true integer type, especially since there is currently no way to tell them the implicit scaling to assume.

With Java 5, we might try to circumvent the latter problem using annotations. But a goal of SEJ is JRE 1.4 compatibility. With annotations, we might have something like:

public interface Input {
  double getRate();
  int getNumberOfItems();
  @ScaledLong(4) long getPrice();
}

For the moment, I shall therefore support only the following automatic input conversions:

byte, treated as a true integer value
int, treated as a true integer value
double, truncated to fit the internal type
BigDecimal, truncated to fit the internal type

Overflows are raised as errors.

On the output side, I shall support:

byte, truncates the internal value
int, truncates the internal value
double, possibly loses precision vis-a-vis the internal value
BigDecimal, always exact

Again, overflows are raised as errors.

May 23, 2006

Had a meeting with Marcel, Igor, Markus and Dani of Abacus today. Lots of new todos. Nevertheless, SEJ is now ready for integration into Abacus Lohn. A conversation with Markus revealed that the currently supported range of Excel functionality will go a very long way.

The .ser file format should really be a .jar. Add an option for whether you want it compressed or not. It contains the generated .class files, and a .xml file representing the internal format of SEJ engines. This makes analyzing .ser files possible using standard tools.

Checked out goal seeking in Excel (as prompted to by Markus). This is purely a menu option, so adding this to SEJ is not really realistic. Maybe as a layer around SEJ. But then, applications can currently do that themselves.

May 18, 2006

Test First

I have implemented caching of values. This has been an elevating experience of test-first and documentation-first development.

I first wrote some notes here in the developer’s journal.

Then I started the new topic in the tutorial and began documenting the feature, drawing on the notes I had first developed here.

Interleaved with the documentation I also wrote the new test cases which I immediately cited again into the documentation.

This took about 3 hours. In the end, I had finished documentation and use-case tests before having written a single line of production code.

I then implemented the feature in about 1:45 h. When the use-case tests ran green, I extended my large spreadsheet-driven test suite to test both caching and non-caching versions. It ran green immediately.

So cool!

Constant Sharing

I started looking at sharing of constant values for expensive types (BigDecimal, for instance). First, I checked that ASM properly combines equal values in the constant pool – it does.

Then I checked the speed of BigDecimal construction using three different approaches:

new BigDecimal( String )
BigDecimal.valueOf( long, scale )
preconstructed in a private static final BigDecimal

The preconstructed case was, of course, fastest. The interesting part is that – for constants whose digits fit into a long, which should be most of them – the difference to BigDecimal.valueOf is not that big. For smaller test runs, it amounts to about 16%. For larger runs it gets closer to 30%. Construction using strings (which SEJ usess right now) is by far the slowest, being about 4 times slower than valueOf( long ).

Aside: Java 6 is already measurably faster than Java 5 on this test.

I have thus changed SEJ so it generates preallocated BigDecimal constants using valueOf( long ) wherever possible. On the JRE 1.4 I had, unfortunately, had to resort to new BigDecimal( String ) for values that already were bigdecimals, because the JRE 1.4 does not support BigDecimal.precision().

May 17, 2006

Correction: BigDecimal was ok in 0.4.0, after all. The compiler inserted the rescaling directly into the code instead of handing it off to the runtime.

Scaled long is up and running now, passing all tests. Instead of making the runtime an instance, I had to add support for adding a last argument with runtime context information to the static runtime methods. This is because otherwise I would have had to know that I would need the runtime instance later on before compiling an expression’s arguments (because the JVM expects the instance for a virtual call on the stack first, not last). That would have greatly complicated the compiler’s design. The context argument I can conveniently push last, where required.

Release 0.4.1 is out.

May 16, 2006

Scaled long support is coming along nicely now. The one big change that’s still missing is to provide the engines with an instance of the runtime. Before, it was sufficient to give them static access. The reason is the runtime must know about things like the scale and rounding mode and it is far simpler to embed this in the runtime once and for all instead of passing it around all the time.

This made me realize that the current BigDecimal operations defined in the runtime do not enforce the fixed scale. I shall fix this for 0.4.1.

May 11, 2006

Since I continually broke the runtime jar, I have now included dedicated runtime tests in the automated build. This also tests engine serialization and deserialization. Yes!

May 9, 2006

Scaled BigDecimal support is finally up and running with all tests green, both on JRE 1.4 and 1.5. Quite a refactoring session that was. I shall soon release it as version 0.4.0.

I had to extend Retrotranslator slightly to support some JRE 1.5 additions to BigDecimal. Retrotranslator’s great design made this very simple.

April 29, 2006

I am working on the BigDecimal constant folder now. It irks me that I have to duplicate a lot of code in the interpreter and the byte code compiler. So I did some tests. Up to Java 5, code such as

double result = getA1() + getA2() * getA3();

is more than twice as fast as

double result = Runtime.opPlus( getA1(), Runtime.opTimes( getA2(), getA3() ) );

Starting with the Java 6 beta, however, both perform identically. This indicates (as another test already did), that the JVM from Java 6 inlines much more aggressively.

This indicates that in the longer run, I can get away with a scheme where the byte code compiler uses a fairly straightforward scheme of compiling expressions to static support methods in a runtime class, which are also used by the interpreter, in an equally generic fashion. What this adds, of course, is a much increased dependency of the generated engines on the runtime class.

April 13, 2006

Just realized that if alternative numeric types are to be supported properly, all the constant folding performed by SEJ will have to be carried out using the alternative types, too. So, for the moment, I shall simply disable constant folding for BigDecimal.

April 12, 2006

Type Inference

I’ve started working on the type annotation algorithm. A key problem seems to be when there is a difference between the expected type (what the outside computation wants) and the inner type (what the value is). Consider, for instance:

BigDecimal getResult1() { return getA().multiply( BigDecimal.TEN ); }
int getResult2() { return getA() + 4; }

What should getA() return? The obvious answer is, of course, the most precise of all the expected types. So it’s BigDecimal here.

Speed Test

Consider, however, the following timings for the repeated execution of the formula x = p + p * f where p = 123.45 and f = 0.076 (yes, I know this can be made more efficient). I coded this formula using double, BigDecimal and both an int and a long scaled by 10’000 (the latter is thus equivalent to the Currency type found in COM and Delphi).

double: 150 ms
int: 200 ms
long: 550 ms
BigDecimal: 5500 ms

Times vary, but the relations are fairly stable. So double is the fastest on my machine (Intel Centrino Core Duo), closely followed by the scaled int. Both are probably not suitable for financial computations, however. Of the remaining two, the scaled long still beats BigDecimal by a factor of 8 to 10. And it might be enough for many financial applications. Particularly so if you can control the scaling factor.

So choosing a faster type can make a huge difference. Can SEJ do this?

Multiple Versions

If getA() is an input cell, we might simply generate n instances of getA(), one for each desired type, with appropriate conversions:

BigDecimal getA_Big() { return BigDecimal.valueOf( this.inputs.getA() ); }
int getA_Int() { return this.inputs.getA(); }

where

interface Inputs {
  int getA();
}

Let’s now consider that getA() is not an input. It then is an intermediate result (that is, a non-input cell that is referenced by multiple other cells). Do we generate multiple instances of the subexpression, one for each desired type? This would be:

BigDecimal getA_Big() { return getB()_Big.add( BigDecimal.valueOf( 20 )); }
int getA_Int() { return getB()_Int + 20; }

assuming getB() is an input like getA() was above. This seems worthwile because we can now compute int getResult2() at full int speed.

Caching

What if, however, getA() were an expensive subexpression? Like a sum over a large dynamic section? Would it not be better to compute it once and cache the result? Like:

BigDecimal computeA() { return /*compute the sum*/; }
BigDecimal getA_Big() {
  if (!isCached_A) {
    cache_A = computeA();
    isCached_A = true;
  }
  return cache_A;
}
int getA_Int() { return getA_Big().intValue(); }

Overflows

What if all the summed cells where int themselves? Should we not sum them as ints then? What if the sum overflows?

Decision

SEJ must make decisions. The question is: Can you affect them and, if so, how?

In view of the overflow problem, I have decided that SEJ will not try to be clever about inferring fast types. Every simple addition of input values already forces escalation to a bigger and slower type, so without hints from outside, SEJ would have to infer slow types for nearly everything very quickly. Who could give the hints? The programmers cannot, because they do not know the computations performed by the sheet. So it would have to be the sheet designers. I cannot imagine them caring about and being able to specify overflow conditions.

What the programmers can tell SEJ is the general class of computation they are dealing with. So I will let them specify the type being used for all numeric computations by a particular engine. The choices will probably be:

double
long, with fixed, definable scale
int, with fixed, definable scale
BigDecimal, with optional minimum scale

The responsibility for this choice, and for communicating its consequences to the sheet designers, thus rests fully with the programmers. But it does allow them to generate engines suited for precise financial or very fast pure integer computations.

April 11, 2006

Release 0.3.2 is out the door. Now I can turn to supporting BigDecimal.

I realized that SEJ should ensure that all abstract methods on the output type are bound when generating an engine.

April 7, 2006

Release 0.3.2 is nearing completion. I just got all the tests running again and can do a complete build, including a version of SEJ for Java 1.4 which also passes all tests. Yay! Here’s the news:

sej-runtime.jar actually works now
More robust formula parsing
Better formula error reporting
IF fully supported, including AND and OR
Java 1.4 compatibility
Dropped Engine.Computation

Before I release, I shall have to test it manually again, and write a dedicated page about current limitations. Missing right now are, in particular:

NOT – this will be easy, I think: just call the inverse test compiler
Range intersection
Vectors
Excel XML
OOCalc
Sections

By the way, the IF logic described below was not fully correct. But I was on the right track. It works like a charm now.

April 6, 2006

I’ll have to try the following:

public abstract class Output {
  public abstract Output newComputation( Input input );
  // ...
}

and then

Output engine = (Output) compiler.compileNewEngine();
Output output = engine.newComputation( input );

But it is hacky! I should introduce a factory object here. Making this factory customizable would allow users to write APIs that completely decouple them from SEJ’s internals, once they get an instantiated engine.

On the topic of Excel formula parsing, I have now gotten the grammar right, I think. I split the lexer into two versions, one for A1-style cell references (for Excel’s .xls format as returned by JExcelAPI), and another for R1C1-style references (for Excel’s .xml format). This allows me to much better recognize cell references versus names. And it cleared up the parsing code quite a bit.

Robert told me in more detail about his problems with SEJ. I have written a very long reply, which I will use as the basis for a better exposition of SEJ’s design decisions. In summary, I believe he is right that I should not have dived headlong into Java 5 for not very strong reasons. Apart from that, though, I still believe SEJ’s design to be very sound.

April 4, 2006

Got feedback from Robert Zachajewicz of together.at on SEJ yesterday. He said that for his needs, SEJ was too strongly tied to particular Java versions and could not be isolated enough from the rest of his application for his liking. I have asked him to elaborate, since I do not yet see the full merit of his point.

Nevertheless, it got me thinking about this myself (it’s always good to get criticism you can take constructively). I realized I can drop the requirement that the ouptut type be a descendant class of Engine.Computation, and, in fact, a class at all. It can be an interface too. I can then drop the Engine.Computation class entirely. This further reduces the noise introduced by SEJ into your own types.

One aspect where he is right is in the planned design of the system tests. The byte code produced from SEJ’s generated Java source by the Java compiler is compared to the byte code SEJ produces directly (using ASM). This is, admittedly, highly version specific. It is, however, only an issue for running SEJ’s own system tests, not for using it. I intend to do it do ensure that the two generators (source code and byte code) in SEJ actually produce the same output. This should raise the confidence that the source output can be used to gain insight by users of SEJ when something in a byte code computation does not seem to work.

While I knew I had tested SEJ against the JRE 1.4 before, I started reading about -target jsr14 some more. And indeed this option is not officially supported by Sun. So I now use Retrotranslator to generate 1.4 compatible .jars.

March 30, 2006

Cells, Ranges and Names

Excel ranges and names are tricky. Here are some examples:

=SUM(2:5): sums the entire rows 2 through 5.
=SUM(B:D): sums the entire columns B through D.
=SUM(namedrange): sums the named range.
=namedrange: is an error, unless the range is unidimensional and the referencing cell is in a parallel range to the named one. However, in such a cell, =SUM(namedrange) still sums the entire range, not just the unidimensional cut through a possibly multidimensional range.

The last rule deserves an example:

Income  Expense  Profit
  $100      $80  =Income-Expense => $180
  $200     $160  =Income-Expense => $360

will work if column A is named Income and column B is named Expense. However

IncomeA  IncomeB  Wrong!                  Expected
   $100     $200  =SUM(Incomes) => $600!      $300
   $120     $180  =SUM(Incomes) => $600!      $300

will not work as expected if Incomes is defined as A2:B3. If you simply define column A2:A3 as IncomeA and column B2:B3 as IncomeB and write:

IncomeA  IncomeB  Wrong!                          Expected
   $100     $200  =SUM(IncomeA;IncomeB) => $600!      $300
   $120     $180  =SUM(IncomeA;IncomeB) => $600!      $300

it is still wrong. This, however, works:

IncomeA  IncomeB  Correct                   Expected
   $100     $200  =IncomeA+IncomeB => $300      $300
   $120     $180  =IncomeA+IncomeB => $300      $300

This can become rather pathological because the seemingly scalar functions AND and OR are actually aggregators, whereas NOT is scalar:

One    Two   Surprise!                 Expected
TRUE   TRUE  =AND(One;Two) => FALSE!!  TRUE
FALSE  TRUE  =AND(One;Two) => FALSE    FALSE

What’s worse, there is, as far as I know, no scalar equivalent for AND and OR, as + is for SUM.

All this leads to some complicated rules for cell and range parsing in Excel expressions. While researching this, I came up with the following web links:

Here’s a few points worth noting:

For some other reason Excel uses a space (or multiple spaces) as the range intersection operator. While not as ambiguous as the comma, it does require some consideration.

Don’t forget that Excel still allows functions to be preceded with an @.

Don’t forget that array constants (surrounded by braces {}) can contain rows, which are delimited with semicolons ;.

The text either side of a colon are not always cell references. Sometimes they are numbers (eg. $25:26).

A plus is not always a plus, sometimes it�s a unary operator, sometimes a binary operator, sometimes the significant figure in scientific notation. eg. 12E+20.

My current attempt at a BNF grammar looks like what follows. Note the possible ambiguity where name occurs in multiple places.

expr ::=
    cell
  | expr + expr
  ...
  | SUM( ranges ).

ranges ::=
    range {"," range}.

range ::=
    name
  | coords ":" coords
  | col ":" col
  | row ":" row.

cell ::=
    name
  | coords.

coords ::=
  | col row
  | "R" {index} "C" {index}.

index ::=
    <integer>
  | "[" <integer> "]".

col ::= ident.
row ::= <integer>.
name ::= ident.

Compiling `IF`

It took me a while to figure out how to compile IF statements the way the Java compiler does. While I haven’t gotten around to implementing it yet, I believe the trick is to have two modes for compilation. I call them branch-false and branch-true. In branch-false, the test branches to a supplied label if the condition is false, otherwise falls through. branch-true reverses this. The reversal is used when compiling OR in branch-false (and, conversely, AND in branch-true).

branch-false does:

void compileOr( Node a, Node b, Label branch ) {
  Label otherTest = newLabel();
  BRANCH_TRUE.compileTest( a, otherTest );
  mark( otherTest );
  compileTest( b, branch );
}
void compileAnd( Node a, Node b, Label branch ) {
  compileTest( a, branch );
  compileTest( b, branch );
}

branch-true does:

void compileOr( Node a, Node b, Label branch ) {
  compileTest( a, branch );
  compileTest( b, branch );
}
void compileAnd( Node a, Node b, Label branch ) {
  Label notMet = newLabel();
  BRANCH_FALSE.compileTest( a, notMet );
  compileTest( b, branch );
  mark( notMet );
}

The initial call is:

void compileIf( Node test, Node iftrue, Node iffalse ) {
  Label notMet = newLabel();
  Label done = newLabel();
  BRANCH_FALSE.compileTest( test, notMet );
  compileExpr( iftrue );
  compileGoto( done );
  mark( notMet );
  compileExpr( iffalse );
  mark( done );
}

March 29, 2006

When you use an input method that may throw a checked exception, you really have to declare that exception on each and every output method. This is because you cannot know all the places where the author of the spreadsheet is going to use that input. Therefore, SEJ should really check that all output methods declare the union of all the declared exceptions of all the input methods. Since this affects the API, I should implement this check early on.

To support internal caching of multiply referenced values, a computation should support a reset() method, which you have to call prior to reusing a computation with modified inputs.

To minimize the dependencies of compiled engines, I should move the saveTo() functionality from the EngineFactory to the Compiler. That way, compiled engines don’t even need interface to the compiler.

I just manually tested release 0.3.1. Had to tweak it a bit so all of the documentation gets properly included. The runtime-only .jar also did not work at all. Shall release this in 0.3.2. Lesson learned: always install and test a release.

Next Steps

Implement full boolean expression support for IF, including AND and OR.

Implement support for BigDecimal. This might be possible without full type inference by simply substituting BigDecimal for double everywhere.

Write an outline of the type inference strategy so implications on the API can be caught.

Decide on how serialized engines will be stored for both maximum load performance and maximum robustness when SEJ evolves.

Suggestions

Do a simple writer that generates an Excel worksheet based on an SEJ model. This is for applications that create the SEJ models themselves given legacy data or formulas defined in a custom UI. They can then write out a template spreadsheet for when users want to switch to spreadsheet mode.

March 21, 2006

Everything is up and running again except for:

empty cells
strings
INDEX and MATCH
subsections (bands)
binding outputs with parameters (needed for interactive demo)

March 18, 2006

Just finished reading Better, faster, lighter Java. It made me realize I have to think a bit more about the exceptions I expose. Here are my rules:

Anything that is a violation of the API contract should not be a declared exception. This is not a situation you want to catch, it is one you want to avoid.

All internal exceptions should be converted to SEJ-specific exceptions (with cause set appropriately) so client code is not affected when loader, compiler, or engine implementations are swapped.

Separate phases. Model errors, compiler errors, and computation errors should not be mixed up.

throws

I also did some experiments about what happens when input methods throw declared exceptions. It is as I suspected: the processing of throws is purely a compiler thing. The VM does not enforce adherence. So if you have an input method like this:

  public String readFile( String _name ) throws IOException  ...

and use it in a computation that you bind to an output method like this:

  public abstract String getData();

the output method generated by SEJ will be able to actually throw the undeclared IOException!

Since the list of declared thrown exceptions is available through reflection, I think SEJ should, in the future, propagate them through computations and check that bound output methods conform to them. This will be mandatory for the source generator anyway, as it needs to properly place throws clauses on the generated methods.

Usability

I shall have to put a fa�ade on the current compiler interface to simplify engine definition. In particular, this fa�ade will handle section scopes and the lookup of both cell names and the methods on the supplied input/output types.

I shall probably also add convenience classes that fully automate engine definition through reflection and cell names in Excel.

March 16, 2006

It’s done. I’ve rewritten the API and ported to old byte-code compiler over to the new design.

I think I shall use version numbers in the class name for the runtime support for the engine. That way, I can easily support multiple versions of stored engines. As long as the implemented interfaces on the engines don’t change, that is.

Bill Venners writes: Although you can grant special access privileges between types belonging to the same package by giving members protected or package access, this special access is granted to members of the same package at runtime only if they were loaded by the same class loader.

This means one has to declare all interfaces to the engine (inputs and outputs) public, even though they very clearly should often be at most package visible.

Typing

Right now, the engine supports only double-valued computations. It already extends this to Dates by simply converting between doubles and dates in the input interface. The output interface does not handle this yet. For the near future, booleans will be handled as doubles too. Excel internally uses 0 for false, 1 for true (and, yes, you can add them).

Which leaves me with the need for data type analysis only for strings and integers. Strings are essential in the long run. Integers (and longs) would allow much faster pure integer computations.

March 10, 2006

I have decided to drop support for the interpreted engine. Still need to verify this with Claudio, though. It will make the API much simpler, in particular the provision of default implementations for output methods not bound to the sheet. And it will let me concentrate my efforts on the one implementation that will deliver the best performance anyway.

The reason is that the interpreted engine cannot support the construction of a computation that descends from a user-supplied base class. If this is possible, however, the specification of the output interface, including default behaviour, becomes very simple. The user simply writes a base class that SEJ should descend the generated computation from. This class can either be abstract, partially abstract, or fully implemented with default behaviour for all getters.

What’s more, since a generated byte-code class will be very much self-contained and rely only on a few interfaces to support classes, the long-term compatibility of the generated engines will be much improved.

March 8, 2006

Started tutorial. Realized that the old API can be replaced by the new, interface-based one (which is the more common case anyway).

March 7, 2006

When you have an input interface that throws an exception, Java’s reflection mechanism converts that to an InvocationTargetException. In order to be able to catch this exception when accessing an output cell, however, you have to declare InvocationTargetException on the output interface. If you don’t, Java will throw an UndeclaredThrowableException.

I have rethought the interface based API. There is an incomplete prototype in the scratchpad right now. I still need to flesh out the sample implementation both for a generic and a byte-code implementation.

Labels

I discovered that Excel can accept cell labels in formulas. This means you can actually do something like:

   A       B
1  f(10)   100
2  x(20)   200
3  Result  =f(10) + x(20)

Not bad, eh? I don’t know, however, whether JExcelAPI handles this properly.

March 1, 2006

The next steps I take with SEJ should help to finalize the API. They are:

Introduce the interface based binding. Make it so that for native code compilers, this can be as efficient as possible, obviating the need for all by-name lookups. Interface-based binding should support parametrized inputs, for example YA_1000 → getYA( 1000 ). Support multiple input interfaces.

Introduce API for names in model so compiler definition code can react to definitions.

Clean up handling of bands. Rethink terminology (again).

Write an outline of the type inference strategy so implications on the API can be caught.

Write a tutorial for both by-name an by-interface binding, including bands and values provided by callbacks.

Decide on how serialized engines will be stored for both maximum load performance and maximum robustness when SEJ evolves.

Bands

I think bands should not be automatically extended across the entire width or height of the spreadsheet. This would allow one to build spreadsheets like:

  Location     In       Out
  One          SUM(In)  SUM(Out)    =B-C
               $100     $50
               $200     $80
  Another      SUM(In)  SUM(Out)    =B-C
               $100     $200

One would define a top-level range A2:D5 and two sub-ranges B3:B4 and C3:C4. This means that the number of Ins may differ from the number of Outs.

Also, the API Compiler.defineBand() is not precise. A band, in its accustomed meaning in report definition tools is the template for a single instance. SEJ’s API, however, wants to to pass the range encompassing all of the sample instances in the sheet. This is also unfortunate because it makes SEJ assume that a single instances is always just one row high or one column wide. This may not always be convenient for the user.

SEJ does not handle sums over nested bands properly. The user must model intermediate sums for every non-leaf band. I shall accept this as a known limitation for the moment and have documented it.