SEJ Developer’s Journal – Old Entries
These are the oldest entries of my journal regarding the development of AFC. The newer entries are on the main journal page. Note that the former name of AFC was SEJ (Spreadsheet Engine for Java).
May 26, 2006
I am working on an introductory example for SEJ for the tutorial (because I need this to streamline the docs and tests for binding and support for multiple data types in the interface). This prompted me to resume work on the simple cell binder. What this thing really needs to do is the following:
- Scan the output interface. For every method returning a type SEJ can handle, try to find a correspondingly named cell (GetXY or just XY). If found, bind it. If not, check if the method is abstract. If so, raise an error.
- Scan the remaining unbound cell names. For every name, check the input interface for a corresponding method. If found, bind the cell to it. If not, raise an error. This can be restricted to scan only a subset of names following a given name pattern, for example, all starting with I_.
For both inputs and outputs, I could extend this so parametrized methods can be bound. For example: I_TURNOVER_2 would be bound to getTurnover( int yearsBack )
.
May 24, 2006
Automatic Conversion of Numeric Types
I discussed with Marcel and Igor that SEJ should allow different numeric input and output types and automatically convert them to the internal numeric type used in computations. This would make input and output interfaces reusable for computations with different internal numeric types, as well as being more convenient. For example:
public interface Input {
double getRate();
int getNumberOfItems();
long getPrice();
}
should be usable with both double
, BigDecimal
and scaled long
engines. This is not so easy:
- You must be aware that for scaled
long
andBigDecimal
, thedouble
returned bygetRate()
might be truncated.
- The
long
returned bygetPrice()
is hard to interpret. In a scaledlong
engine, it should be treated as an already scaled long. Otherwise, there is no way an application can pass properly scaled longs into such an engine. For other engines, however, thelong
would more naturally be interpreted as a true integer type, especially since there is currently no way to tell them the implicit scaling to assume.
With Java 5, we might try to circumvent the latter problem using annotations. But a goal of SEJ is JRE 1.4 compatibility. With annotations, we might have something like:
public interface Input {
double getRate();
int getNumberOfItems();
@ScaledLong(4) long getPrice();
}
For the moment, I shall therefore support only the following automatic input conversions:
byte
, treated as a true integer valueint
, treated as a true integer valuedouble
, truncated to fit the internal typeBigDecimal
, truncated to fit the internal type
Overflows are raised as errors.
On the output side, I shall support:
byte
, truncates the internal valueint
, truncates the internal valuedouble
, possibly loses precision vis-a-vis the internal valueBigDecimal
, always exact
Again, overflows are raised as errors.
May 23, 2006
Had a meeting with Marcel, Igor, Markus and Dani of Abacus today. Lots of new todos. Nevertheless, SEJ is now ready for integration into Abacus Lohn. A conversation with Markus revealed that the currently supported range of Excel functionality will go a very long way.
The .ser file format should really be a .jar. Add an option for whether you want it compressed or not. It contains the generated .class files, and a .xml file representing the internal format of SEJ engines. This makes analyzing .ser files possible using standard tools.
Checked out goal seeking in Excel (as prompted to by Markus). This is purely a menu option, so adding this to SEJ is not really realistic. Maybe as a layer around SEJ. But then, applications can currently do that themselves.
May 18, 2006
Test First
I have implemented caching of values. This has been an elevating experience of test-first and documentation-first development.
- I first wrote some notes here in the developer’s journal.
- Then I started the new topic in the tutorial and began documenting the feature, drawing on the notes I had first developed here.
- Interleaved with the documentation I also wrote the new test cases which I immediately cited again into the documentation.
This took about 3 hours. In the end, I had finished documentation and use-case tests before having written a single line of production code.
- I then implemented the feature in about 1:45 h. When the use-case tests ran green, I extended my large spreadsheet-driven test suite to test both caching and non-caching versions. It ran green immediately.
So cool!
Constant Sharing
I started looking at sharing of constant values for expensive types (BigDecimal
, for instance). First, I checked that ASM properly combines equal values in the constant pool – it does.
Then I checked the speed of BigDecimal
construction using three different approaches:
new BigDecimal( String )
BigDecimal.valueOf( long, scale )
- preconstructed in a
private static final BigDecimal
The preconstructed case was, of course, fastest. The interesting part is that – for constants whose digits fit into a long
, which should be most of them – the difference to BigDecimal.valueOf
is not that big. For smaller test runs, it amounts to about 16%. For larger runs it gets closer to 30%. Construction using strings (which SEJ usess right now) is by far the slowest, being about 4 times slower than valueOf( long )
.
Aside: Java 6 is already measurably faster than Java 5 on this test.
I have thus changed SEJ so it generates preallocated BigDecimal
constants using valueOf( long )
wherever possible. On the JRE 1.4 I had, unfortunately, had to resort to new BigDecimal( String )
for values that already were bigdecimals, because the JRE 1.4 does not support BigDecimal.precision()
.
May 17, 2006
Correction: BigDecimal
was ok in 0.4.0, after all. The compiler inserted the rescaling directly into the code instead of handing it off to the runtime.
Scaled long is up and running now, passing all tests. Instead of making the runtime an instance, I had to add support for adding a last argument with runtime context information to the static runtime methods. This is because otherwise I would have had to know that I would need the runtime instance later on before compiling an expression’s arguments (because the JVM expects the instance for a virtual call on the stack first, not last). That would have greatly complicated the compiler’s design. The context argument I can conveniently push last, where required.
Release 0.4.1 is out.
May 16, 2006
Scaled long
support is coming along nicely now. The one big change that’s still missing is to provide the engines with an instance of the runtime. Before, it was sufficient to give them static access. The reason is the runtime must know about things like the scale and rounding mode and it is far simpler to embed this in the runtime once and for all instead of passing it around all the time.
This made me realize that the current BigDecimal
operations defined in the runtime do not enforce the fixed scale. I shall fix this for 0.4.1.
May 11, 2006
Since I continually broke the runtime jar, I have now included dedicated runtime tests in the automated build. This also tests engine serialization and deserialization. Yes!
May 9, 2006
Scaled BigDecimal
support is finally up and running with all tests green, both on JRE 1.4 and 1.5. Quite a refactoring session that was. I shall soon release it as version 0.4.0.
I had to extend Retrotranslator slightly to support some JRE 1.5 additions to BigDecimal
. Retrotranslator’s great design made this very simple.
April 29, 2006
I am working on the BigDecimal constant folder now. It irks me that I have to duplicate a lot of code in the interpreter and the byte code compiler. So I did some tests. Up to Java 5, code such as
double result = getA1() + getA2() * getA3();
is more than twice as fast as
double result = Runtime.opPlus( getA1(), Runtime.opTimes( getA2(), getA3() ) );
Starting with the Java 6 beta, however, both perform identically. This indicates (as another test already did), that the JVM from Java 6 inlines much more aggressively.
This indicates that in the longer run, I can get away with a scheme where the byte code compiler uses a fairly straightforward scheme of compiling expressions to static support methods in a runtime class, which are also used by the interpreter, in an equally generic fashion. What this adds, of course, is a much increased dependency of the generated engines on the runtime class.
April 13, 2006
Just realized that if alternative numeric types are to be supported properly, all the constant folding performed by SEJ will have to be carried out using the alternative types, too. So, for the moment, I shall simply disable constant folding for BigDecimal
.
April 12, 2006
Type Inference
I’ve started working on the type annotation algorithm. A key problem seems to be when there is a difference between the expected type (what the outside computation wants) and the inner type (what the value is). Consider, for instance:
BigDecimal getResult1() { return getA().multiply( BigDecimal.TEN ); }
int getResult2() { return getA() + 4; }
What should getA()
return? The obvious answer is, of course, the most precise of all the expected types. So it’s BigDecimal
here.
Speed Test
Consider, however, the following timings for the repeated execution of the formula x = p + p * f
where p = 123.45
and f = 0.076
(yes, I know this can be made more efficient). I coded this formula using double
, BigDecimal
and both an int
and a long
scaled by 10’000 (the latter is thus equivalent to the Currency
type found in COM and Delphi).
double
: 150 msint
: 200 mslong
: 550 msBigDecimal
: 5500 ms
Times vary, but the relations are fairly stable. So double
is the fastest on my machine (Intel Centrino Core Duo), closely followed by the scaled int
. Both are probably not suitable for financial computations, however. Of the remaining two, the scaled long
still beats BigDecimal
by a factor of 8 to 10. And it might be enough for many financial applications. Particularly so if you can control the scaling factor.
So choosing a faster type can make a huge difference. Can SEJ do this?
Multiple Versions
If getA()
is an input cell, we might simply generate n instances of getA()
, one for each desired type, with appropriate conversions:
BigDecimal getA_Big() { return BigDecimal.valueOf( this.inputs.getA() ); }
int getA_Int() { return this.inputs.getA(); }
where
interface Inputs {
int getA();
}
Let’s now consider that getA()
is not an input. It then is an intermediate result (that is, a non-input cell that is referenced by multiple other cells). Do we generate multiple instances of the subexpression, one for each desired type? This would be:
BigDecimal getA_Big() { return getB()_Big.add( BigDecimal.valueOf( 20 )); }
int getA_Int() { return getB()_Int + 20; }
assuming getB()
is an input like getA()
was above. This seems worthwile because we can now compute int getResult2()
at full int
speed.
Caching
What if, however, getA()
were an expensive subexpression? Like a sum over a large dynamic section? Would it not be better to compute it once and cache the result? Like:
BigDecimal computeA() { return /*compute the sum*/; }
BigDecimal getA_Big() {
if (!isCached_A) {
cache_A = computeA();
isCached_A = true;
}
return cache_A;
}
int getA_Int() { return getA_Big().intValue(); }
Overflows
What if all the summed cells where int
themselves? Should we not sum them as int
s then? What if the sum overflows?
Decision
SEJ must make decisions. The question is: Can you affect them and, if so, how?
In view of the overflow problem, I have decided that SEJ will not try to be clever about inferring fast types. Every simple addition of input values already forces escalation to a bigger and slower type, so without hints from outside, SEJ would have to infer slow types for nearly everything very quickly. Who could give the hints? The programmers cannot, because they do not know the computations performed by the sheet. So it would have to be the sheet designers. I cannot imagine them caring about and being able to specify overflow conditions.
What the programmers can tell SEJ is the general class of computation they are dealing with. So I will let them specify the type being used for all numeric computations by a particular engine. The choices will probably be:
double
long
, with fixed, definable scaleint
, with fixed, definable scaleBigDecimal
, with optional minimum scale
The responsibility for this choice, and for communicating its consequences to the sheet designers, thus rests fully with the programmers. But it does allow them to generate engines suited for precise financial or very fast pure integer computations.
April 11, 2006
Release 0.3.2 is out the door. Now I can turn to supporting BigDecimal
.
I realized that SEJ should ensure that all abstract methods on the output type are bound when generating an engine.
April 7, 2006
Release 0.3.2 is nearing completion. I just got all the tests running again and can do a complete build, including a version of SEJ for Java 1.4 which also passes all tests. Yay! Here’s the news:
sej-runtime.jar
actually works now- More robust formula parsing
- Better formula error reporting
IF
fully supported, includingAND
andOR
- Java 1.4 compatibility
- Dropped
Engine.Computation
Before I release, I shall have to test it manually again, and write a dedicated page about current limitations. Missing right now are, in particular:
NOT
– this will be easy, I think: just call the inverse test compiler- Range intersection
- Vectors
- Excel XML
- OOCalc
- Sections
By the way, the IF
logic described below was not fully correct. But I was on the right track. It works like a charm now.
April 6, 2006
I’ll have to try the following:
public abstract class Output {
public abstract Output newComputation( Input input );
// ...
}
and then
Output engine = (Output) compiler.compileNewEngine();
Output output = engine.newComputation( input );
But it is hacky! I should introduce a factory object here. Making this factory customizable would allow users to write APIs that completely decouple them from SEJ’s internals, once they get an instantiated engine.
On the topic of Excel formula parsing, I have now gotten the grammar right, I think. I split the lexer into two versions, one for A1-style cell references (for Excel’s .xls format as returned by JExcelAPI), and another for R1C1-style references (for Excel’s .xml format). This allows me to much better recognize cell references versus names. And it cleared up the parsing code quite a bit.
Robert told me in more detail about his problems with SEJ. I have written a very long reply, which I will use as the basis for a better exposition of SEJ’s design decisions. In summary, I believe he is right that I should not have dived headlong into Java 5 for not very strong reasons. Apart from that, though, I still believe SEJ’s design to be very sound.
April 4, 2006
Got feedback from Robert Zachajewicz of together.at on SEJ yesterday. He said that for his needs, SEJ was too strongly tied to particular Java versions and could not be isolated enough from the rest of his application for his liking. I have asked him to elaborate, since I do not yet see the full merit of his point.
Nevertheless, it got me thinking about this myself (it’s always good to get criticism you can take constructively). I realized I can drop the requirement that the ouptut type be a descendant class of Engine.Computation
, and, in fact, a class at all. It can be an interface too. I can then drop the Engine.Computation
class entirely. This further reduces the noise introduced by SEJ into your own types.
One aspect where he is right is in the planned design of the system tests. The byte code produced from SEJ’s generated Java source by the Java compiler is compared to the byte code SEJ produces directly (using ASM). This is, admittedly, highly version specific. It is, however, only an issue for running SEJ’s own system tests, not for using it. I intend to do it do ensure that the two generators (source code and byte code) in SEJ actually produce the same output. This should raise the confidence that the source output can be used to gain insight by users of SEJ when something in a byte code computation does not seem to work.
While I knew I had tested SEJ against the JRE 1.4 before, I started reading about -target jsr14
some more. And indeed this option is not officially supported by Sun. So I now use Retrotranslator to generate 1.4 compatible .jars.
March 30, 2006
Cells, Ranges and Names
Excel ranges and names are tricky. Here are some examples:
=SUM(2:5)
- sums the entire rows 2 through 5.
=SUM(B:D)
- sums the entire columns B through D.
=SUM(namedrange)
- sums the named range.
=namedrange
- is an error, unless the range is unidimensional and the referencing cell is in a parallel range to the named one. However, in such a cell,
=SUM(namedrange)
still sums the entire range, not just the unidimensional cut through a possibly multidimensional range.
The last rule deserves an example:
Income Expense Profit
$100 $80 =Income-Expense => $180
$200 $160 =Income-Expense => $360
will work if column A is named Income and column B is named Expense. However
IncomeA IncomeB Wrong! Expected
$100 $200 =SUM(Incomes) => $600! $300
$120 $180 =SUM(Incomes) => $600! $300
will not work as expected if Incomes is defined as A2:B3. If you simply define column A2:A3 as IncomeA and column B2:B3 as IncomeB and write:
IncomeA IncomeB Wrong! Expected
$100 $200 =SUM(IncomeA;IncomeB) => $600! $300
$120 $180 =SUM(IncomeA;IncomeB) => $600! $300
it is still wrong. This, however, works:
IncomeA IncomeB Correct Expected
$100 $200 =IncomeA+IncomeB => $300 $300
$120 $180 =IncomeA+IncomeB => $300 $300
This can become rather pathological because the seemingly scalar functions AND
and OR
are actually aggregators, whereas NOT
is scalar:
One Two Surprise! Expected
TRUE TRUE =AND(One;Two) => FALSE!! TRUE
FALSE TRUE =AND(One;Two) => FALSE FALSE
What’s worse, there is, as far as I know, no scalar equivalent for AND
and OR
, as +
is for SUM
.
All this leads to some complicated rules for cell and range parsing in Excel expressions. While researching this, I came up with the following web links:
Here’s a few points worth noting:
- For some other reason Excel uses a space (or multiple spaces) as the range intersection operator. While not as ambiguous as the comma, it does require some consideration.
- Don’t forget that Excel still allows functions to be preceded with an @.
- Don’t forget that array constants (surrounded by braces
{}
) can contain rows, which are delimited with semicolons;
.
- The text either side of a colon are not always cell references. Sometimes they are numbers (eg.
$25:26
).
- A plus is not always a plus, sometimes it�s a unary operator, sometimes a binary operator, sometimes the significant figure in scientific notation. eg.
12E+20
.
My current attempt at a BNF grammar looks like what follows. Note the possible ambiguity where name
occurs in multiple places.
expr ::=
cell
| expr + expr
...
| SUM( ranges ).
ranges ::=
range {"," range}.
range ::=
name
| coords ":" coords
| col ":" col
| row ":" row.
cell ::=
name
| coords.
coords ::=
| col row
| "R" {index} "C" {index}.
index ::=
<integer>
| "[" <integer> "]".
col ::= ident.
row ::= <integer>.
name ::= ident.
Compiling IF
It took me a while to figure out how to compile IF
statements the way the Java compiler does. While I haven’t gotten around to implementing it yet, I believe the trick is to have two modes for compilation. I call them branch-false and branch-true. In branch-false, the test branches to a supplied label if the condition is false, otherwise falls through. branch-true reverses this. The reversal is used when compiling OR
in branch-false (and, conversely, AND
in branch-true).
branch-false does:
void compileOr( Node a, Node b, Label branch ) {
Label otherTest = newLabel();
BRANCH_TRUE.compileTest( a, otherTest );
mark( otherTest );
compileTest( b, branch );
}
void compileAnd( Node a, Node b, Label branch ) {
compileTest( a, branch );
compileTest( b, branch );
}
branch-true does:
void compileOr( Node a, Node b, Label branch ) {
compileTest( a, branch );
compileTest( b, branch );
}
void compileAnd( Node a, Node b, Label branch ) {
Label notMet = newLabel();
BRANCH_FALSE.compileTest( a, notMet );
compileTest( b, branch );
mark( notMet );
}
The initial call is:
void compileIf( Node test, Node iftrue, Node iffalse ) {
Label notMet = newLabel();
Label done = newLabel();
BRANCH_FALSE.compileTest( test, notMet );
compileExpr( iftrue );
compileGoto( done );
mark( notMet );
compileExpr( iffalse );
mark( done );
}
March 29, 2006
When you use an input method that may throw a checked exception, you really have to declare that exception on each and every output method. This is because you cannot know all the places where the author of the spreadsheet is going to use that input. Therefore, SEJ should really check that all output methods declare the union of all the declared exceptions of all the input methods. Since this affects the API, I should implement this check early on.
To support internal caching of multiply referenced values, a computation should support a reset()
method, which you have to call prior to reusing a computation with modified inputs.
To minimize the dependencies of compiled engines, I should move the saveTo()
functionality from the EngineFactory
to the Compiler
. That way, compiled engines don’t even need interface to the compiler.
I just manually tested release 0.3.1. Had to tweak it a bit so all of the documentation gets properly included. The runtime-only .jar also did not work at all. Shall release this in 0.3.2. Lesson learned: always install and test a release.
Next Steps
- Implement full boolean expression support for
IF
, includingAND
andOR
.
- Implement support for
BigDecimal
. This might be possible without full type inference by simply substitutingBigDecimal
fordouble
everywhere.
- Write an outline of the type inference strategy so implications on the API can be caught.
- Decide on how serialized engines will be stored for both maximum load performance and maximum robustness when SEJ evolves.
Suggestions
- Do a simple writer that generates an Excel worksheet based on an SEJ model. This is for applications that create the SEJ models themselves given legacy data or formulas defined in a custom UI. They can then write out a template spreadsheet for when users want to switch to spreadsheet mode.
March 21, 2006
Everything is up and running again except for:
- empty cells
- strings
- INDEX and MATCH
- subsections (bands)
- binding outputs with parameters (needed for interactive demo)
March 18, 2006
Just finished reading Better, faster, lighter Java. It made me realize I have to think a bit more about the exceptions I expose. Here are my rules:
- Anything that is a violation of the API contract should not be a declared exception. This is not a situation you want to catch, it is one you want to avoid.
- All internal exceptions should be converted to SEJ-specific exceptions (with
cause
set appropriately) so client code is not affected when loader, compiler, or engine implementations are swapped.
- Separate phases. Model errors, compiler errors, and computation errors should not be mixed up.
throws
I also did some experiments about what happens when input methods throw declared exceptions. It is as I suspected: the processing of throws
is purely a compiler thing. The VM does not enforce adherence. So if you have an input method like this:
public String readFile( String _name ) throws IOException ...
and use it in a computation that you bind to an output method like this:
public abstract String getData();
the output method generated by SEJ will be able to actually throw the undeclared IOException
!
Since the list of declared thrown exceptions is available through reflection, I think SEJ should, in the future, propagate them through computations and check that bound output methods conform to them. This will be mandatory for the source generator anyway, as it needs to properly place throws
clauses on the generated methods.
Usability
I shall have to put a fa�ade on the current compiler interface to simplify engine definition. In particular, this fa�ade will handle section scopes and the lookup of both cell names and the methods on the supplied input/output types.
I shall probably also add convenience classes that fully automate engine definition through reflection and cell names in Excel.
March 16, 2006
It’s done. I’ve rewritten the API and ported to old byte-code compiler over to the new design.
I think I shall use version numbers in the class name for the runtime support for the engine. That way, I can easily support multiple versions of stored engines. As long as the implemented interfaces on the engines don’t change, that is.
- Bill Venners writes
- Although you can grant special access privileges between types belonging to the same package by giving members protected or package access, this special access is granted to members of the same package at runtime only if they were loaded by the same class loader.
This means one has to declare all interfaces to the engine (inputs and outputs) public, even though they very clearly should often be at most package visible.
Typing
Right now, the engine supports only double
-valued computations. It already extends this to Date
s by simply converting between doubles and dates in the input interface. The output interface does not handle this yet. For the near future, booleans will be handled as doubles too. Excel internally uses 0 for false, 1 for true (and, yes, you can add them).
Which leaves me with the need for data type analysis only for strings and integers. Strings are essential in the long run. Integers (and longs) would allow much faster pure integer computations.
March 10, 2006
I have decided to drop support for the interpreted engine. Still need to verify this with Claudio, though. It will make the API much simpler, in particular the provision of default implementations for output methods not bound to the sheet. And it will let me concentrate my efforts on the one implementation that will deliver the best performance anyway.
The reason is that the interpreted engine cannot support the construction of a computation that descends from a user-supplied base class. If this is possible, however, the specification of the output interface, including default behaviour, becomes very simple. The user simply writes a base class that SEJ should descend the generated computation from. This class can either be abstract, partially abstract, or fully implemented with default behaviour for all getters.
What’s more, since a generated byte-code class will be very much self-contained and rely only on a few interfaces to support classes, the long-term compatibility of the generated engines will be much improved.
March 8, 2006
Started tutorial. Realized that the old API can be replaced by the new, interface-based one (which is the more common case anyway).
March 7, 2006
When you have an input interface that throws an exception, Java’s reflection mechanism converts that to an InvocationTargetException
. In order to be able to catch this exception when accessing an output cell, however, you have to declare InvocationTargetException
on the output interface. If you don’t, Java will throw an UndeclaredThrowableException
.
I have rethought the interface based API. There is an incomplete prototype in the scratchpad right now. I still need to flesh out the sample implementation both for a generic and a byte-code implementation.
Labels
I discovered that Excel can accept cell labels in formulas. This means you can actually do something like:
A B
1 f(10) 100
2 x(20) 200
3 Result =f(10) + x(20)
Not bad, eh? I don’t know, however, whether JExcelAPI handles this properly.
March 1, 2006
The next steps I take with SEJ should help to finalize the API. They are:
- Introduce the
interface
based binding. Make it so that for native code compilers, this can be as efficient as possible, obviating the need for all by-name lookups. Interface-based binding should support parametrized inputs, for example YA_1000 → getYA( 1000 ). Support multiple input interfaces.
- Introduce API for names in model so compiler definition code can react to definitions.
- Clean up handling of bands. Rethink terminology (again).
- Write an outline of the type inference strategy so implications on the API can be caught.
- Write a tutorial for both by-name an by-interface binding, including bands and values provided by callbacks.
- Decide on how serialized engines will be stored for both maximum load performance and maximum robustness when SEJ evolves.
Bands
I think bands should not be automatically extended across the entire width or height of the spreadsheet. This would allow one to build spreadsheets like:
Location In Out
One SUM(In) SUM(Out) =B-C
$100 $50
$200 $80
Another SUM(In) SUM(Out) =B-C
$100 $200
One would define a top-level range A2:D5 and two sub-ranges B3:B4 and C3:C4. This means that the number of Ins may differ from the number of Outs.
Also, the API Compiler.defineBand()
is not precise. A band, in its accustomed meaning in report definition tools is the template for a single instance. SEJ’s API, however, wants to to pass the range encompassing all of the sample instances in the sheet. This is also unfortunate because it makes SEJ assume that a single instances is always just one row high or one column wide. This may not always be convenient for the user.
SEJ does not handle sums over nested bands properly. The user must model intermediate sums for every non-leaf band. I shall accept this as a known limitation for the moment and have documented it.