topic Re: Cache millions of records in Workato Pros Discussion Board

Cache millions of records

kolson — Thu, 17 Mar 2022 22:23:34 GMT

If you need to cache a large amount of data for subsequent retrieval in a recipe, how would you do it? Would you put it in a lookup table (limit 100K rows) or in a list of hashes?

Is there a limit on the number of rows in a list?

I am wondering how to cache a million or more records.

Re: Cache millions of records

kolson — Thu, 17 Mar 2022 22:51:27 GMT

I see that this is not possible in Workato.

Re: Cache millions of records

anthony-oconnor — Thu, 17 Mar 2022 22:56:58 GMT

I'm not sure what the row limit is on collections, but you might want to look at those, they also allow you to use SQL to query against them and have more columns than lookup tables.
They do not persist outside of the running recipe though, whereas lookup tables do.

I know this isn't helpful but I'd probably just look at redesigning something so you don't have to cache 1M+ rows in your middleware.

Re: Cache millions of records

kolson — Thu, 17 Mar 2022 23:37:33 GMT

Agreed it is not optimal, but some use cases would require it.

There are some cases where you would might need to have a large dataset of records, say from Workday, to compare to what currently is in a legacy DB, and doing repeated calls across the network to the database is slow, or maybe there are requirements for a snapshot of a continually changing source to compare to another large dataset. So it may not be the common case, but you might need to do an ETL solution requiring temporary storage of a large dataset. It is not *always* indicative of bad design.

The limit on the size of a collection is 50K records, BTW.

Re: Cache millions of records

steven-marissen — Thu, 17 Mar 2022 23:43:46 GMT

How about storing these records onto one or multiple Pub/Sub topic(s)?

And then have your secondary recipe(s) read from those to process them.

Along with that we also use a SQL server to deal with these larger quantities.

Re: Cache millions of records

kolson — Thu, 17 Mar 2022 23:46:51 GMT

That is a good suggestion, Steven, for some use cases. However, I am referring to cases where the entire dataset is being used as a whole - like for joins or comparisons with another related dataset all at once.

So it is more of an ETL type process vs a streaming or record-by-record process.

Re: Cache millions of records

sanjay-rathod — Fri, 18 Mar 2022 01:25:03 GMT

1. If you would retain data across different Jobs.. you need to use external storage(e.g. AWS S3 / MySql etc.)

2. In case of cache and use large data within a single Job executinon...

- You can use Collection (First, prepare the list of raw data and then create the collection from that list). Collection is faster to retrieve the records and easy to understand. I have used this approach for the 100K rows.

Yes, it's limit of 50k records in single query... but by using LIMIT Offset we could get next bunch of records easily.

Re: Cache millions of records

kolson — Fri, 18 Mar 2022 02:59:04 GMT

Those are some very interesting options, Sanjay - thanks.

I am thinking more of having a data set where you have random access to any record or set of records in the entire set, so the batch method would not apply here, although in some cases it would work really well.

The AWS or third party storage is an interesting idea, for sure.

Re: Cache millions of records

gwilkinson — Tue, 22 Mar 2022 19:23:32 GMT

Workato is working on a Cloud DB persistence layer that may be useful for this use case. Not sure of the details when they will roll this new feature out nor the details of how it will work and any limitations. I would think it certainly could handle millions of records. Perhaps someone from the Workato Product team can comment (Konstantin perhaps).

Re: Cache millions of records

patrick-steil — Thu, 16 Mar 2023 19:12:46 GMT

I know this is very late to this discussion, but just for future people looking for such a solution. It seems to me that this is something huge that Workato is missing. The ability to save off "state" of one kind of another in some type of cache and then re-use that again with some type of timeout.

So if it cannot be done in Workato, it certainly could be done in a number of different ways via current cloud technologies:

Basically use something like ElasticCache, Redis, AWS DynamoDB, MongoDB to package up anything you want to cache and then save it in the cache. Then the next time you run the recipe, check the cache to see if that data is populated (and not expired). This could save a lot of steps in the workflow, but of course would require at least one new step to check the cache (and to save off your data in the cache).

If anyone is interested in such a system I would love to architect an API that would make this super easy for any Workato recipe. It could even incorporate encryption so that no one could see your data in the cache, or of course be setup specifically for your company to be segregated properly.

Would be a great AWS project so that it is super fast, secure and scalable.

Patrick

Re: Cache millions of records

meghan-legaspi — Tue, 21 Mar 2023 19:13:30 GMT

Hey @patrick-steil,

We at Workato have recently released two features that looks like a good fit with this specific ETL use case.

Workato FileStorage is a persistent storage system within Workato that can be used to store large volume data (limitless number of rows in the order of millions) and this data can be fetched and used across recipes.

Second, we have the new SQL Transformations utility connector that can operate on CSV data stored within FileStorage and perform transformations (compare with other data sets, live application data extract, or other CSV files), etc.

By using both these features, you can store any volume data as CSV within Workato and run queries on them. More details on the two feature are available here - FileStorage, SQL Transformations.

The features are premium and are in beta stage, so reach out to Workato's customer support to get more details.

Cheers!
Meghan