Systematic Community

kolson · ‎03-17-2022

If you need to cache a large amount of data for subsequent retrieval in a recipe, how would you do it? Would you put it in a lookup table (limit 100K rows) or in a list of hashes?

Is there a limit on the number of rows in a list?

I am wondering how to cache a million or more records.

kolson · ‎03-17-2022

That is a good suggestion, Steven, for some use cases. However, I am referring to cases where the entire dataset is being used as a whole - like for joins or comparisons with another related dataset all at once.

So it is more of an ETL type process vs a streaming or record-by-record process.

sanjay-rathod · ‎03-17-2022

1. If you would retain data across different Jobs.. you need to use external storage(e.g. AWS S3 / MySql etc.)

2. In case of cache and use large data within a single Job executinon...

- You can use Collection (First, prepare the list of raw data and then create the collection from that list). Collection is faster to retrieve the records and easy to understand. I have used this approach for the 100K rows.

Yes, it's limit of 50k records in single query... but by using LIMIT Offset we could get next bunch of records easily.

kolson · ‎03-17-2022

Those are some very interesting options, Sanjay - thanks.

I am thinking more of having a data set where you have random access to any record or set of records in the entire set, so the batch method would not apply here, although in some cases it would work really well.

The AWS or third party storage is an interesting idea, for sure.

gwilkinson · ‎03-22-2022

Workato is working on a Cloud DB persistence layer that may be useful for this use case. Not sure of the details when they will roll this new feature out nor the details of how it will work and any limitations. I would think it certainly could handle millions of records. Perhaps someone from the Workato Product team can comment (Konstantin perhaps).

patrick-steil · ‎03-16-2023

I know this is very late to this discussion, but just for future people looking for such a solution. It seems to me that this is something huge that Workato is missing. The ability to save off "state" of one kind of another in some type of cache and then re-use that again with some type of timeout.

So if it cannot be done in Workato, it certainly could be done in a number of different ways via current cloud technologies:

Basically use something like ElasticCache, Redis, AWS DynamoDB, MongoDB to package up anything you want to cache and then save it in the cache. Then the next time you run the recipe, check the cache to see if that data is populated (and not expired). This could save a lot of steps in the workflow, but of course would require at least one new step to check the cache (and to save off your data in the cache).

If anyone is interested in such a system I would love to architect an API that would make this super easy for any Workato recipe. It could even incorporate encryption so that no one could see your data in the cache, or of course be setup specifically for your company to be segregated properly.

Would be a great AWS project so that it is super fast, secure and scalable.

Patrick

Systematic Community

Cache millions of records