The Search for Fast Aggregates - Implementation & Results

By Daniel Wood, 13 August 2011

In the final part of this two part series, we go into a little more detail about the aggregate technique we developed in the first article. We also carry out a number of comparison aggregate tests and present the results. We also try to address some of the limitations of this technique and ways in which you could try to work around these:

Link to Part One - Trial & Error

A very quick overview of where we are at

  • Summary fields & aggregate calcs require all record data to be sent from host to client
  • This makes aggregates on large datasets or large records slow
  • We found an "All Values" value list based on a single field, does not send all record data from host to client. Instead, it sends just that field value from all records (perhaps using the fields index).
  • Using this idea, we build a value list of values we want to aggregate, the idea being we evaluate the value list and apply an aggregate to the values, bypassing the need to download all the records.
  • These value lists like to combine unique values from records into a single entry. For this reason our value list had to be based on a special calculation whose contents was unique for every record, that contained the value we wanted to aggregate, and be easy for us to retrieve the original value when needed.

If you haven't yet read part one of the article then it is probably a really good idea to give it a look before reading this one, it will make this article make a lot more sense to you!

Retrieving the Value List contents

So at this stage we have built our special calculation, and built a value list based on that. The end result is we obtain a value list with 1 entry per record for every record in the table. Here is a snippet of what the value list looks like:

107.6JDBD
107.7BFHA
107.7DDGB
107.7FGIC
107.7HCFB
107.7HEFE
107.7HHEI
107.7IFCD
107.8BAGB
107.8BICE
107.8CJAF

You can see the numeric component is the weight field value we wish to aggregate eventually. The alpha part is a conversion of the primary key value from a number into an alphabetical string.

The value list contents is obtained using the ValueListItems function:

ValueListItems ( Get ( FileName ) ; "Aggregate_Weight All" )

With the contents of the value list obtained, we can use the Filter function to strip out the alpha-representation of the primary key:

Filter ( theValueList ; "0123456789-.¶")

You will note we have also kept some other possible characters that occur in a number, the negative sign and a decimal point. We also retain the return delimiters to keep the values as a list. The end result is:

107.6
107.7
107.7
107.7
107.7
107.7
107.7
107.7
107.8
107.8
107.8

Hey, this is looking promising! We now have a list of every Weight value in a list, and we obtained it by getting just the value, not the entire record, we are on the right track.

So how do we perform an aggregate?

We looked at a couple of ways to turn this list into an aggregate result, and decided that a custom function would be a great way to achieve this.

Our initial attempt focused simply on a sum aggregate. We began by running a substitute on the list, replacing all return characters with plus signs:

Substitute ( theList ; "¶" ; "+" )

Taking the snippet of the value list above, this gave us:

107.6+107.7+107.7+107.7+107.7+107.7+107.7+107.7+107.8+107.8+107.8

Which is now a simple FileMaker expression. We then ran this result through the Evaluate function. This function takes a text string as a parameter, and evaluates it as if it were a FileMaker expression:

Let ( [
theList = ValueListItems ( Get ( FileName ) ; "Aggregate_Weight All" ) ;
theList = Substitute ( theList ; "¶" ; "+" )
];
Evaluate ( theList )
)

End result - well it "kind" of worked. With small lists it worked great, but we found the larger the number of items we aggregated, we were presented with a "?" result.

The reason for this is that FileMaker has a limit on the number of characters permitted in any calculation expression. According to the FileMaker websites technical specifications, this limit is 30,000 characters.

To get around this issue, we built a recursive custom function that performed the above operation on the list, but in blocks of 500 values at a time. So the first 500 values were summed, and the result of that was added to the sum of the second 500, the third 500, and so on until each block of values had been summed. This worked great, and was pretty quick too. For example if you were summing 10,000 values, the custom function would perform 20 recursions, each summing 500 values.

That's great for sum, but what about min/max/average?

Because we were using the Evaluate function, we realised that we could adapt our custom function to take a second parameter, that being the name of the aggregate function we wished to apply

fastaggregates2 1

This is the original version of the custom function. The overall list is broken into chunks of 1000 values aggregated at any given recursion of the function.

The first few variable declarations in the function grab the 1000 values, chop off the trailing return character, and replace all return delimiters with semi-colons (the character that is used as the separator for aggregate functions). At this point we end up with something like:

107.6;107.7;107.7;107.7;107.7;107.7;107.7;107.7;107.8;107.8;107.8

The AggregateFunction parameter is the name of an aggregate function, which is then placed before the above string. For example if the parameter was sum, we build the following string:

Sum ( ... the above string... )

Run that sucker through the Evaluate function, and you have your aggregate! The handling of blocks of values using recursion keeps the evaluation expression under the necessary character limit.

The "Average" aggregate function, a curly one!

Sum, Min and Max work great with this custom function. The reason is that if you add values, you get one total result. Similarly there is only ever 1 minimum and 1 maximum value in an overall list. It doesn't matter if you add the list up in smaller blocks, or compare blocks of values with other blocks to obtain a min or max value, the outcome is always the same.

Average is different however. To determine average aggregate, the total number of values has to be known, and the Average function determines this based on the number of parameters it is passed.

Consider this example, a list of the following values:

1
2
3
4
5
Average ( 1 ; 2 ; 3 ; 4 ; 5 )

The result of this calculation is 3. The average function first sums the values, and then divides it by the number of passed values to obtain the average. In this case, that is 15/3 = 5.

Now consider out implementation of the above custom function using blocks of values from the overall list. What happens if we split the list into two:

1
2
3

and

4
5

Using the custom function, we end up with the average of the first list being 6/3 = 2. The average of the second list is 9/2 = 4.5. The function would then combine the results into another aggregate calculation:

Average ( 2 ; 4.5 )

Which is deemed to be 6.5 / 2 = 3.25 , this is certainly NOT the expected result of 5 we were after.

The simple fact is the average function has to be aggregated in a different way. We first need to sum the contents of the list, and then divide it by the number of values in the original list.

Rather than try to make the current custom function any more complicated, we simply created a new custom function called ValueList_Average that simply runs the other Aggregate custom function as a Sum aggregate, then divides the result by the number of values in the list:

fastaggregates2 2

Enough technical mumbo jumbo, time for some results

For testing, we built an example file (see bottom of this article). The test file carries out three types of aggregate tests:

  • Summary fields
  • Aggregate Functions
  • Value List technique with our Custom Aggregate Function

For each type of aggregate, we test the sum, average, min and max aggregates. The testing is done on a Contacts table with 10,000 records. There are a total of 32 stored fields on the table, and each record averages about 0.6kb in size for a total table size of 5.9 megabytes.

Test One: File opened locally client machine

For the first test, we are eliminating network as a factor in testing. This will hopefully give us insight into the performance of the functions used, and the performance of the value list retrieval and processing.

fastaggregates2 3

fastaggregates2 4

fastaggregates2 5

As you can see, the results of the 3 tests are very similar. Multiple run-throughs of the test yield various aggregates reach the 1 second mark, but on following tests are under a second, and it is fairly random. This suggests the summary fields and aggregate functions are basically identical in speed. The higher frequency of more 1 second results for the value list implementation means it's basically the same speed, but maybe fractionally slower.

Speeds for all 3 tests are very fast with 10,000 records when the database is local because no records have to be sent over the network, so any overhead is a result of the aggregate methods themselves - of which there is very little overhead at all.

Test Two: File opened remotely over wide area network - 10k records

Here is the real acid test. The file has been hosted on a remote server and is being accessed over a fairly average New Zealand internet connection. In order to give a fair test for each method, the file was closed and reopened after each test. The reason for this is to clear the clients cache of downloaded records from the host.

fastaggregates2 6

fastaggregates2 7

fastaggregates2 8

So, what can we tell from these results? Well first thing to note is that these tests above were carried out in order of the aggregates shown. This gives us clear indication that the very first time an aggregate is evaluated, the high performance hit is taken first time, as all of the records are downloaded from host to client. As you can see, the summary and aggregate function tests yield pretty much the same time taken to download all 10,000 records.

The value list implementation on the other hand, was over 11 times faster than the other two aggregate methods. The performance gain in just aggregating using a value list is huge. The main reason is far less traffic is sent between host and client. We showed in the local file test that overhead of the actual processing of the values is negligible, so by reducing the major bottleneck of traffic sent between host and client, we have greatly improved the speed of the aggregate calculations!

Note that if the tests were run a second or third time directly after opening the file and running the first test, they were almost instant to calculate, the reason is the file had downloaded and cached all the records in the first 2 cases, and in the third the value list had been cached, so future tests were essentially working with local data.

Changes to weight values on the records prior to running a new test resulted in the summary field and aggregate function tests taking relatively the same amount of time. The value list option took slightly longer in this case because since the value list had changed contents, the whole value list is resent from host to client - however given it is pretty damn quick the performance impact is minimal.

Test Three: File opened remotely over wide area network - 100k records

For the last test we are going all out and bumping up the record count to 100,000. The file is opened under the same conditions as test number two.

fastaggregates2 10

fastaggregates2 11

fastaggregates2 12

Again, tests 1 and 2 take relatively the same amount of time, this time over 10 minutes, subsequent aggregates are much faster as they are working on the now cached records.

But once again we see the value list test has shone through, being over 10 times faster than the regular aggregate methods.

Another thing we did when running this 100k test was to fire up an application called Wireshark. This is a utility which lets you capture network traffic and keep track of packets transmitted and data flowing over the network.

We setup Wireshark to only capture data sent between the client and the host server, on FileMaker port 5003. Here is what we found:

fastaggregates2 13

For the aggregate function test we found that total packet traffic between host and client was 84,608 packets of data over the 10+ minutes. There was a total of 72,588,262 total bytes transmitted, which equates to a whopping 69.2 megabytes!.

fastaggregates2 14

And here are the results for the value list technique. Note the total packets transmitted was 5,372. That is close to 16 times less packets transmitted - and less packets means less overall latency and data transmitted. The total data transmitted was 4,430,483 bytes which equates to only 4.2 megabytes - 16.5x less data transmitted!

Wow, so it's faster that's awesome... and here's the bad news...

To this point we've been demonstrating a pretty ideal example. We have aggregated every record in the table. But seriously how often would you really want to do that? Perhaps there are some situations where you might want to do that, for example if you are using a special reporting table where you are generating records for reporting purposes.

More often than not however you want to aggregate a subset of a tables overall records. Consider aggregate functions, they work best when aggregating related records through a relationship, which generally are filtered to be a subset of the tables overall records. Summary fields also excel in sub-summary reports where they are tallying subsets of the records in groupings.

Unfortunately this value list technique does have limitations on how far you can take it.

The value list has to be stored.

This is perhaps the biggest limiting factor. Because the field used in the value list has to be stored, you cannot introduce complex filtering into your calculation based on globals, or fields from other tables.

Lets consider an example where you have a couple of global fields where you enter a date range. You want to sum all records whose date field falls within your chosen range. How would you achieve this?

Well first instinct would be to change your value list calculation to reflect this filtering. One cool and useful property of the value list, is if a record contains an empty value for the field used in the value list, that record is essentially omitted from the value list. this has no performance impact when it comes to our method which is great - if anything it makes the method faster as the a smaller set of values in the value list means faster processing times.

Back to the problem though. Our date range is specified in global fields, so we cannot use those in the calculation, which means we can't set the calculation result to blank if the record calls outside of the chosen date range, bummer!

Having said that, some filtering is possible...

Hard-coded filtering is possible though and works quite well. Consider a contacts table with 10,000 records. There is a status field that contains either "Active" or "Deleted". Further to that is a Country field.

Lets say we want to find the average weight of all contacts living in america whose status is active. To do this we would simply modify our value list calculation as such:

if ( Status = "Active" and Country = "USA" ; 
Weight & numToText ( Contact ID ) ; ""
)

Here we have hard-coded a check in. If the contact is active and in the USA, we put a result in the calculation, if not the result is blank, and this record will not appear in the value list. Then we just run through the usual process of obtaining the aggregate from the value list.

This is nice, but very restrictive. For each subset of records you wanted to aggregate in this way, you would require a separate calculation field and value list to do so. Perhaps this might be fine depending on your situation, but it is quite limited.

Found sets and summaries

The other thing this technique is not, is a replacement for summary fields in sub-summary reports. This technique simply cannot be used to replace summary fields in this manner. Calculations just can't know about sub-summary parts in this way and if they could, in all likelihood the calculation would end up being unstored anyway, thus breaking at the value list stage.

So in conclusion....

We've talked about a lot in this article and the previous one. We've shown various aggregate methods, and why some perform slow over wide are networks. We've walked through the process of coming up with an alternative solution to the performance issue, and as a result came up with a technique combining a number of FileMaker areas including Value Lists, custom functions, and calculations.

We have also gone into detail about building an aggregate custom function to turn a list of values into the desired result. We then showed results of testing of each of the three presented aggregate methods, and showed that the value list method has huge potential for significant performance gains under very specific conditions.

Finally, we showed that this technique is most certainly not a perfect replacement for good old fashioned summary fields and aggregate functions. It can only be used under certain conditions, but if those match your situation, then you could be in for a good performance gain!

Example File

Please find attached an example file. This file was used for all the screenshots in this article, and is provided to help you fully understand what is going on in this article, and to let you experiment in FileMaker with this solution. In the file you will also find a couple of additional tests briefly touched upon in the article but not fully tested. These are in the area of aggregating a subset of the overall records in a table.

Download Example File With 10,000 Records (3.5mb)

Download Example File With 100,000 Records (36.4mb)

Readers Note:

Christoph Teschers pointed out in the comments that in the example files the tests do not yield correct results if negative values are used in the field being aggregated. He points out that this can be resolved by adding a - symbol into the Filter function, ie:

 Filter ( ValueListItems ( Get ( FileName ) ; "Aggregate_Weight All" ) ; "-0123456789.¶" )

Thank's for spotting that one Christoph!

Something to say? Post a comment...

Comments

  • Nick Lightbody 02/12/2014 12:33am (10 years ago)

    Daniel

    Very interesting and thank you for sharing.

    Where is response to a comment you touched on Bruce Robertson's virtual list technique it took me back 2 or 3 years to when Bruce helped me to develop a Ui for mobile devices using virtual lists - but - instead of the "traditional" methods of building the virtual list - walking the set or SQL - I choose to build those lists from value lists delivered by FMS12 because I knew that version of server - very new then - handled value lists much more efficiently than previous versions.

    This worked very well - so I would have thought you could use that here to combine your liking for the efficiency of value lists - which I share - with a virtual list?

    Or perhaps there is another reason why not?

    Cheers, Nick

  • Gianluca 25/08/2014 5:16am (10 years ago)

    Ciao Daniel,

    This info helped me a lot with a 300K file hosted on a remote server. I have to do at least 6 different summaries on it and if with the wrong connection it took a lot.

    By the way I checked even the summary ( List ) function. If you use for instance the weight as in your sample it yields all the data coming from the field and not only unique values.

    So, may be that your technique might be simplified using the summary ( list ) field type?

    I didn't done extensive test but it seems that the answer is faster.

  • Charity 25/10/2011 7:33am (13 years ago)

    Daniel said, "It seems crazy to me that if you had a table of 100 fields, but merely wanted to export 1 field, server would send you the whole lot, I guess that's just how it works. "

    Hi Daniel down south!! :)

    Is this added justification then to split tables, putting less accessed fields in a 1 to 1 relationship table? FM should produce a whitepaper showing a flow of data depending upon type between servers and users and how that all works; what downloads, what doesn't so we can quickly make good choices. I wondered about portals too.

    And thank you for sharing your knowledge with us. I am learning things it would normally take 10 years to learn on my own, if ever...

    Char()

  • Fabrice Nordmann 02/09/2011 10:03am (13 years ago)

    Hi Daniel,

    thanks a lot for sharing this.
    Another improvement you could make to your filtering would be to check the decimal separator. (we are quite a few using the coma : http://en.wikipedia.org/wiki/File:DecimalSeparator.svg )

    Filter ( * ; "-0123456789¶" & middle ( 3/2 ; 2 ; 1 ))

    But frankly, I found SO interesting your testing of the value lists downloading or not downloading records depending on their configuration. Thanks !

  • Daniel Wood 23/08/2011 4:55pm (13 years ago)

    Hi Christoph, thanks for spotting that, I have added a little appendage to the end of the article so people don't lose your finding in the comments :) Cheers!

  • Christoph Teschers 23/08/2011 3:31pm (13 years ago)

    Hey Daniel - thanks for this genious idea! Most interesting is your way of exploring and bending the boundaries of FM! Awesome!
    I played around a bit with your file and found that negative values do not work (admitted, not likely when summing up weight, but maybe for keeping track of lollies in the jar...).
    However, this is easy to fix by adding a "-" to your Filter function (and it works for all four functions!):

    theList = Filter ( ValueListItems ( Get ( FileName ) ; "Aggregate_Weight All" ) ; "-0123456789.¶" )

    :)

  • Daniel Wood 21/08/2011 8:59pm (13 years ago)

    Hi Richard. Your idea is a good one, and was one of the options I considered when looking at how to parse the Weight value back out of the value list after the value list items were retrieved. In testing, I found that methods such as the one you mention would require the use of a recursive custom function to traverse the list one item at a time and transform the list by eliminating something - in this case removing parts using LeftWords or RightWords.

    The problem was one of speed - a recursive custom function gets progressively slower as the size of the list increases as each value in the list requires a new call to the custom function. It also means the list is restricted to the call stack size FileMaker imposes on recursive custom function calls. All of this resulted in something that was not ideal. So I looked at ways the same end result could be achieved but without recursive custom functions, such as using native FileMaker functions. This was about the best I could figure out through using the Filter function, the result was it was much faster so I went with it.

    Having said that, a list-traversing recursive custom function may be suitable for smaller record sets :)

  • Richard 19/08/2011 12:34am (13 years ago)

    Hi Daniel - again a great idea - thank you for it. Apart from the technique it got me looking at the the list of functions again as the LIST function is not one of my regular ones that I use. It is surprising how entrenched we get in our usual functions - especially after 16 years!

    One more point - when you were combining weights with record IDs and converting one into letters for later parsing out - why not use create the calculation field that combines the two into one field and leave the two parts separated by a space and make it into a text field. Then use the Leftwords and Rightswords functions to split them apart later and convert them mack into numbers?

    - Richard

  • Mark Richman 18/08/2011 1:52pm (13 years ago)

    This is a great pair of articles Daniel.

    I'm glad to see other developers grappling publicly with the concept of record-level-data vs. field-level-data. I felt after last year's DevCon that this was perhaps the most difficult concept to fully grasp, followed next by the subsequent impact of the found set and the movement of 'unintentional' data.

    I'm also glad it won't be me wracking my brain to figure out all the nuances of how to apply these concepts in the real world, such as you've done here :-)

  • Daniel Wood 18/08/2011 7:44am (13 years ago)

    Hi Richard,
    Unfortunately the referenced file would bring though all the record data when carrying out the aggregate if you were directly referencing it. If it imported the information from the records say via a server-side script overnight then maybe thats a different story :) It sounds like you are thinking perhaps about the virtual list technique, which I tried to figure out a way to use in this solution too, but alas it all boils down to what does and doesn't download entire record data in FileMaker. The virtual list technique is awesome, but in this situation it still requires you to access record data which is downloaded :(

    In regards to stuff that doesn't download record data - it's a very small list! Obviously "All Values" value lists fall outside the realm of downloading all record data as shown. The Get(Foundcount) function doesn't download all record data as the Count function does, which makes it an awesome tool for obtaining record counts both on a found set and via a relationship.

    The one that is piking my curiosity right now is Get(RecordID). I'm racking my brain for potential applications for this. One avenue I have pursued (with some success) has been in the area of sorting a portal by a particular field - without technically sorting it. I'll give you a brief idea of how it works below. Actually the sorting application was going to originally be this article, but I changed it at the last second to aggregates instead as it seemed to be easier to understand, anyway here goes:

    1. Unstored calc in Contacts table containing Get(RecordID)
    2. Stored calc in Contacts table containing Field you want to sort records by & Get(RecordID). Similar to in this article, you convert any digits in sort-field to alpha-characters. Ie calc might be Full Name & Get(recordID). You would convert any digits in Full Name field to letters...
    3. Build all values value list based on the calc in step 2. This is going to be sorted by Full Name
    4. Grab value list items & parse out the full name. You now have a list of Get(RecordID)'s for every record, in order sorted by Full Name
    5. For your portal relationship, use the List function from your base context, and list the Get(RecordID) <b>unstored</b>calculation. This is very fast to list this particular function.
    6. Run the FilterValues function, ie FIlterValues ( allIDs ; relationshipIDs ). So first part is the IDs from the value list, and second is the one you just grabbed thru the relationship in step #4. The end result is a list of Get(RecordID)'s through the relationship <b>in sorted order</b> because we essentially kept the IDs from the value list that were in the relationship..
    7. Another unstored calc on the Contacts table, this one is going to now evaluate the list built in #6. This calc gets tricky and makes use of Get(RecordNumber). For more info on how the calc is kinda setup, check the article about Fast sorting in reverse creation order, it's a similar idea. The gist is the first record in the relationship is set to the Get(RecordID) which is in the first position in the list from #6. Second record grabs second ID, third the third and so on....
    8. One more calc field in Contacts containing Get(RecordID) but this time it is a stored calc. Establish a relationship from the table occurrence the portal is based on, to Contacts. This one is going to match on the calc from step #7, to the Calc created here in step #8.
    9. In the portal, place fields from the new table occurrence created in #8..

    Make sense? It's a bit crazy yes, but in testing it worked. One of the drawbacks to the method is that it has to grab a value list from all items (as this article does), so depending on the # of records in the table it has potential for slowness. The other bottleneck is FilterValues. The more values you are filtering by the slower it becomes. So if you had 1,000 records in the portal, then it can be slow because FilterValues running on 1000 items is quite a costly function.

    However, having said all that, in testing it works. I don't see any harm in using Get(RecordID) as the basis of a relationship for this particular function. All calcs using the function are stored or unstored, and I don't see any way there can be duplicates for the records. Duplicates using Get(RecordID) only happen I think if the value is stored in a field (not a stored calc) and the database is cloned, in that situation there is potential for duplicates. However given it's all calculation based I think it is fine (prove me wrong!)

    Maybe I'll write that up as an article at some point. If you are keen to grab the demo file I had for that let me know.

  • Richard 18/08/2011 6:19am (13 years ago)

    Hi Daniel - As Tim has asked it would be good to know what does and what does not send whole records about the place.

    I was wondering if you created a file with say 100,000 records numbered with serial numbers from one to 100,000. Then in the file you want to get an aggregate from you enter serial numbers in only those that you want to get results from. Then using a relationship you draw through the data onto the stand-alone file and use that data only for an aggregate. Don't tell me - the reference file also brings all the record data to the client computer? - Richard

  • Tim Anderson 18/08/2011 5:57am (13 years ago)

    Great thinking, for a lot of 'simple' solutions your value list method will work fine.

    It would be useful if we could collate, or get hold of if it exists, a list of functions that do not send full records.

    Thanks for your inspiration

    Tim

  • Daniel Wood 17/08/2011 9:30am (13 years ago)

    Hi Richard, in our tests we have found that exporting even a single field either to disk, or to another table (via import) requires the entire contents of the record to be downloaded, so no go there unfortunately, at least in the ways we have tried. It seems crazy to me that if you had a table of 100 fields, but merely wanted to export 1 field, server would send you the whole lot, I guess that's just how it works. Exporting happens client-side, and so server decides to just send you the whole record and then leave it up to the client to do what it needs to do - be it export 1 field or whatever :-)

  • Richard 17/08/2011 9:15am (13 years ago)

    HI Daniel - a very interesting and informative article. I was wondering if exporting a single field (or the fields required for the report) to a dedicated table would make the aggregate as fast of maybe faster?

  • Kevin Frank 15/08/2011 3:29pm (13 years ago)

    Another amazing article (or in this case, pair of articles), and a very generous gift to the FileMaker community. Thank you Daniel.

RSS feed for comments on this page | RSS feed for all comments

Categories(show all)

Subscribe

Tags