TNG50-1-Dark's sublink tree seems to only use a single 331 GB file. To the best of my knowledge, this is the only sublink tree where this happened. This makes it a little non-trivial to do analysis on the file (basically, it'll force us to do partial reads and some extra bookkeeping when those reads split branches in half, nothing too scary).
I was curious why this happened to this particular simulation. It sounds like the trees are split automatically by the code and I was wondering if there was something weird in the tree structure that prevented that (e.g. there's no legal place to split the tree or something).
Best regards,
-Phil
Dylan Nelson
21 Apr '21
Hi Phil,
Nothing special in particular, likely just a heuristic when to split which didn't quite work. I think ultimately we are moving towards no file chunks, as there really isn't any reason. This is particularly true with analysis running on the server, i.e. in the Lab, but also since any tree (or subset) can be extracted from a single file with the same performance as if it was split into 100 smaller files.
I suppose the old reason was to make it easier to download over the web, but in theory one can send a resume/partial request for a file, so also it shouldn't matter much here.
Hi all,
TNG50-1-Dark's sublink tree seems to only use a single 331 GB file. To the best of my knowledge, this is the only sublink tree where this happened. This makes it a little non-trivial to do analysis on the file (basically, it'll force us to do partial reads and some extra bookkeeping when those reads split branches in half, nothing too scary).
I was curious why this happened to this particular simulation. It sounds like the trees are split automatically by the code and I was wondering if there was something weird in the tree structure that prevented that (e.g. there's no legal place to split the tree or something).
Best regards,
-Phil
Hi Phil,
Nothing special in particular, likely just a heuristic when to split which didn't quite work. I think ultimately we are moving towards no file chunks, as there really isn't any reason. This is particularly true with analysis running on the server, i.e. in the Lab, but also since any tree (or subset) can be extracted from a single file with the same performance as if it was split into 100 smaller files.
I suppose the old reason was to make it easier to download over the web, but in theory one can send a resume/partial request for a file, so also it shouldn't matter much here.