I'm currently building a set of tools to make accessing illustris data easy and quick with only a laptop and an internet connection. They're mostly built, and can be found (and used!) here: https://github.com/zpenoyre/illustrisAPI
The hope is for them to become a major avenue for people to get at the data, requiring very little understanding of illustris's slightly byzantine (though only because of the scope of the project) data. Certainly all of my work with the simulation could be achieved via this package (saving hundreds of lost hours fighting with hdf5 cutouts and a million unit conversions) and I believe that will be the case for the vast majority of users.
It's in a completely usable state right now (and I'd love to hear people's experiences with it, get in touch if you're interested!) but there's still some things that need to be finished/ neatened/ improved.
The following is copied and pasted from an email correspondence with Dylan Nelson. He suggested I open it up to anyone interested on here. It's a (long, sorry) list of questions of things that would streamline the operations of this package. Not all are that feasible, but those who ask...
Faster requests - this question basically stems from my very patchy understanding of how the internet works, but is there any way to speed up the time each request takes. What is the bottleneck on this? (I have a nasty suspicion it may be the speed of light, or alternatively something on the user's side but I thought it a question worth asking)
Download speed - There seems to be a max download speed (of ~3Mbps) regardless of my internet connection, is it possible there's a bottleneck on the server side?
Long files - We talked about the download being interrupted for long files (which gives a "Truncated end of file" error when you try to open it). Is there any way to work around this? Obviously it's a much more complex bit of tech but a browser seems to be perfectly happy pausing and continuing a download, can the python requests package do the same?
Merger trees - Thanks for implementing the Main Descendant Branch! However, whilst it's a neat fix, it's got some real issues. Mostly that if the halo does not make it back to z=0 the download can become unexpectedly huge.
More generally, while there's obviously a wealth of information in the merger trees, the form and content of the files is a real headache. The structure, whilst clever, is a real trial to learn. And there's this huge amount of extra info stored in the form of all the GroupCat content.
I propose reformatting the main meat of the merger tree into three files:
1 - A main branch (per sublink subhalo ID) - (2xNsnaps) a simple list of the main progenitor branch of a halo, with the snapshot number and subfind id stored.
2 - A merger history (per sublink subhalo ID) - For each merging galaxy, lists the snapshot number and subfind id of every progenitor (except the main). I'd also like to add the snapshot number and id at the time of max stellar mass, and of infall into the halo, which is an easy addition.
3 - A lookup table (per snapshot) - Gives the sublink subhalo ID for a given subfind id, allowing you to find the above two files for a given snapshot and subfind id.
This file could be superceeded by just adding this value to the data in e.g. "http://www.illustris-project.org/api/Illustris-1/snapshots/133/subhalos/9/" or even just attaching these two files to that subhalo in trees.
I've (mostly) written code that will make these files. If I finish it and send it to you would you be willing to generate these and put them on the API? This would allow people to access the history of one galaxy, and find and follow any merging galaxy, without any extra computation.
A smaller sample - At the moment a lot of the cost of downloading the halo/subhalo catalogs and the merger trees comes from the huge number of tiny and slightly (arguably) boring DM dominated halos. Could we think of some good criteria to try and weed out as many of these as we can without getting rid of any data people would find useful?
Obviously both data sets can co-exist, but having a pared down version could hugely shrink the file size that most people need.
One possible criteria could be to split by halos and subhalos with no stars. A seperate version of the group catalogs with only the halos/subhalos with stars would reduce the file size by a large factor (~4 for illustris 3 at z=0). A more universal divider, perhaps with criteria involving stellar, gas, DM and BH mass could probably be worked out also.
Other languages - you said at one point that this package can't be of general use if it's only in python. It feels like it should be possible to write IDL and matlab (any others?) wrappers that just run the python scripts and translate the returned numpy arrays. This is far from my field of expertise but in your eyes would that suffice? You'd still have to have python and a few packages installed but at least you could run it out the box in a matlab/IDL script.
Dylan Nelson
4 Sep '17
Hi Zephyr,
Thanks for the thoughts, some comments:
Faster requests/download speed: this just comes down to resources - I designed the Illustris public data release with a budget of $0, meaning I had to make do with everything available. Then, for example, the database is only as fast as it is, and the internet connection likewise. But I don't think this is a real problem, and indeed haven't seen that this is a fundamental limitation of anyone's use of Illustris data. I expect that requests should generally be answerable in 200ms (e.g. 5 per second), and the link speed is likely capable of ~100 MB/s for access to the raw data files. This capacity is of course shared with everyone.
"Long files" (truncated/corrupted HDF5 files for large requests) - this is my primary interest of something to fix, and I will work on it.
Simplified merger trees:
a. the main branch is already in the tree, and accessible, so I'm not sure what exactly you mean? this data is insignificant in size, so there is no harm in having additional fields, which can just be ignored if you only want the snapshot numbers and subfind IDs.
b. merger history: "the snapshot number and subfind id of every progenitor" sounds like the complete tree to me? I'm not sure what you mean beyond this.
For "snapshot number and id at the time of max stellar mass, and of infall into the halo" this is essentially analysis, similar to what has been done for the "Stellar Assembly" supplementary catalog. If you want to compute values like this and make a catalog file, I am happy to add it in that fashion.
c. lookup table: this is already available and used in the public data scripts, we call these "Offsets" - see Subhalo_SublinkSubhaloID in the data documentation.
Smaller sample: I see the point, but interesting objects for you are not interesting objects for everyone, so you're really just talking about making a sub-sample of the group catalog in general. I don't think we need to make and store various subsets of the catalogs, the assumption is that people will choose objects as they like. This can be done by searching in the API/search form, or by downloading the group catalog files and selecting in a script. I think, for your scripts, for sample construction they should simply download the group catalog files as needed and cache them, creating subsets on the fly as requested.
Other languages: this is just so that the data is accessible widely and without any specific dependencies. Since this is already the case and working, I don't think it needs to be re-architectured for other languages to run a python script and interpret the result, what is the goal here?
Zephyr Penoyre
6 Sep '17
Hi Dylan, thanks for the response. Her are some follow up thoughts:
Faster requests/download speed: understood, thought this was a long shot
Long files: Good to hear your on the case. Some of my later comments are (slightly patchy) ways to counteract specifically this, i.e. making files smaller and thus less susceptible to incomplete downloads, but perhaps there are more universal solutions.
Simplified merger tree: This isn't a case of me trying to create data that isn't there (which is obviously impossible anyway!) just trying to put it in a form where it's convenient to access and easy to understand. Don't forget that right now the trees come padded with all the data from the subhalo catalogs as well! Let me reply to each point individually:
a) the main branch: As stated above, I fully acknowledge that this is already in the tree, but that at the moment it 1) requires a bit of background understanding of the tree structure to get out and b) can be found efficiently for galaxies at z=0, quite efficiently for some galaxies at other redshifts (using your main descendant branch) but for a reasonable fraction of galaxies retrieving this data is a significant download. For example, if I'm looking at a subhalo at z=1 which will end up in the merger tree of subhalo 0 at z=0 there is no way to trace it's main progenitor branch (which obviously will not make it all the way to z=0) without downloading the whole merger tree for the most massive subhalo (which is a huge file) and then needs a bit of sifting through to reproduce the mpb.
Instead, I'm proposing a simple file, called something like Tree_10478 (the exact number is arbitrary) which would read [snapshotNumber, subhaloNumber] for every entry in the mpb. I.e. if this subhalo exists between snapshots 80 and 90 and has subhalo number 1000 at snapshot 85 it might look (using some random numbers as an example) like:
This is obviously a very small and easy to read file, and as long as there's one associated with every subhalo it's very easy to trace any subhalo without downloading and sorting through the trees
b) Merger history. Let's only focus on the moment of a merger according to subfind (ignoring the time of maximum stellar mass and infall) and look at what I'm suggesting and why it would be useful.
If every subhalo has it's main progenitor branch saved (as detailed in section a) then all we need to link and trace any merging galaxy is the subhalo and snapshot number of the merging galaxy (which can then be used to find the merging galaxies own mpb and trace that if desired).
Thus for the above halo, if we say it merges with three smaller galaxies, at snapshots 81, 84 and 88 we could have a merger tree, in a file called Mergers_10478 that looks like this (again the structure is [snapshotNumber, subhaloNumberOfMergingGalaxy] and the numbers are made up for demonstration purposes):
81, 1005
84, 8974
88, 132
This is clearly much smaller and easier to parse and understand than the whole tree, but contains all the necessary info to retrieve any information (and more) contained in the merger tree. Again, this is not trying to add any information, just turn the info that is there into the most accessible and manageable form.
c) Lookup table: Sorry, perhaps I should have been more clear, this table contains the indices needed to find the above files. This is just to close the loop so that the full tree for any subhalo, at any snapshot, can be retrieved.
A note on the philosophy of this approach: I think maybe we're not seeing eye to eye here as you're imagining people wanting to look at the whole tree for a subhalo (including all the tiny merging subhalos and their histories) where I'm imagining people being more interested in either instantaneous mergers (i.e. we care about the properties of all merging galaxies at the time of merger but not their histories) or significant mergers (i.e. we care about the properties and evolution of only a handful of the merging galaxies). Of course, all the data to follow every halo over the whole simulation is there (and still about equivalently easy to access as in the original trees) but I'm trying to bring to the forefront a cut-down and more easily understood and used version of the history.
Smaller sample: This was mostly just a suggestion to reduce download sizes. I agree that many people may want to look at many different facets of the data. I'm not for a moment suggesting replacing the old catalogs, just that a more streamlined version might be made and put online alongside the original. But maybe a consensus on what part of the data would be usefully left out cannot be reached.
It's simply a reaction to the fact that most users will necesarilly need to download some halo/subhalo catalog properties and that this is currently the biggest download they're likely to make. It's obviously easy to cut down the catalog once they've been downloaded, but by then the hard work's already been done.
Other languages: This was just me trying to work out how to make these tools useful to the widest audience, but it sounds like you might not think it worth it if there are still python dependencies.
Zephyr Penoyre
6 Sep '17
whoops, in the above examples of what might be in the Tree and Merger file the linebreaks got deleted when I published the comment. Hopefully it's still relatively clear what was meant.
Dylan Nelson
7 Sep '17
I think the most practical solution is: if you have code, or want to write code, to derive these things from the SubLink merger tree data, then if you send it to me I will integrate it into the API. Each should be a function e.g.
def return_interesting_data(subhalo_id, snapshot_number):
tree1 = illustris_python.sublink.loadTree(subhalo_id, snapshot_number)
tree2 = illustris_python.sublink.loadTree(subhalo_id_RootDescendant, snapshot_RootDescendant)
# do stuff and make data, a dictionary of numpy arrays
return data
I hope it's clear what I mean? tree1 and tree2 are the available data, of course they are the same if snapshot_number=135 (z=0). Otherwise, tree2 is the parent tree which contains tree1.
Zephyr Penoyre
7 Sep '17
Hi Dylan,
Are you suggesting to make this a process that runs serverside and computes then returns the data at each call? If so that sounds like a great idea (provided the serverside operations are quick, but I think opening and analysing a tree should be doable pretty efficiently).
If so this function should be pretty easy to write, any language issues or dependencies I should be aware of? I'm guessing python 2 with numpy is fine
Dylan Nelson
7 Sep '17
Hi yes that would be my suggestion
Zephyr Penoyre
7 Sep '17
Forgive me if I'm being stupid, but I've now hit upon this issue a few times:
If I run loadTree for snapshot_number<135, there seems no easy way to recover the portion of the merger tree at later times.
obviously the tree contains "RootDescendantID", but this is a subhalo_ID (in sublink), there seems no obvious way to translate this into subhalo_id_RootDescendant and snapshot_RootDescendant.
Am I missing something, how are you doing it to recover the main descendant branch in the API?
Dylan Nelson
10 Sep '17
These values such as RootDescendantID are indices into the SubLink tree. For example, if this is 9, then you need to load the 10th entry of e.g. SnapNum and SubfindID. This is, like everything else, a global index. For example, if the tree contains 2 files with 8 entries each, you need to load the 2nd entry of the second file.
From the docs: "The number inside each circle from the figure is the unique ID (within the whole simulation) of the corresponding subhalo, which is assigned in a depth-first fashion. Numbering also indicates the on-disk storage ordering for the SubLink trees. For example, the main progenitor branch (from 5-7 in the example) and the full progenitor tree (from 5-13 in the example) are both contiguous subsets of each merger tree field, whose location and size can be calculated using these links."
Dylan Nelson
10 Sep '17
For example the current codes of the API:
def treeOffsets(sim_name, number, id, type=None):
""" Handle offset loading for a merger tree cutout. """
fileBase, gcBase = basePaths(sim_name)
# load groupcat chunk offsets from header of first file
with h5py.File(gcPath(gcBase,number,0),'r') as f:
groupFileOffsets = f['Header'].attrs['FileOffsets_Subhalo']
# calculate target groups file chunk which contains this id
groupFileOffsets = int(id) - groupFileOffsets
fileNum = np.max( np.where(groupFileOffsets >= 0) )
groupOffset = groupFileOffsets[fileNum]
with h5py.File(gcPath(gcBase,number,fileNum),'r') as f:
# load the merger tree offsets of this subgroup
if type == "sublink":
RowNum = f["Offsets"]['Subhalo_SublinkRowNum'][groupOffset]
LastProgID = f["Offsets"]['Subhalo_SublinkLastProgenitorID'][groupOffset]
SubhaloID = f["Offsets"]['Subhalo_SublinkSubhaloID'][groupOffset]
return RowNum,LastProgID,SubhaloID
if type == "lhalotree":
TreeFile = f["Offsets"]['Subhalo_LHaloTreeFile'][groupOffset]
TreeIndex = f["Offsets"]['Subhalo_LHaloTreeIndex'][groupOffset]
TreeNum = f["Offsets"]['Subhalo_LHaloTreeNum'][groupOffset]
return TreeFile,TreeIndex,TreeNum
def loadSublinkTree(sim_name, number, id, fOut=None, mpbOnly=False, mdbOnly=False):
""" Load portion of Sublink tree, for a given subhalo, return either flat HDF5 or hierarchical dict."""
fileBase, gcBase = basePaths(sim_name)
# the tree is all subhalos between SubhaloID and LastProgenitorID
RowNum,LastProgID,SubhaloID = treeOffsets(sim_name, number, id, type='sublink')
if RowNum == -1:
raise Http404
rowStart = RowNum
rowEnd = RowNum + (LastProgID - SubhaloID)
# load only main progenitor branch? in this case, get MainleafProgenitorID to optimize load
if mpbOnly:
with h5py.File(sublinkPath(fileBase),'r') as fTree:
MainLeafProgenitorID = fTree['MainLeafProgenitorID'][rowStart]
# re-calculate tree subset
rowEnd = RowNum + (MainLeafProgenitorID - SubhaloID)
# load only main descendant branch (e.g. from z=0 descendant to current subhalo)
if mdbOnly:
with h5py.File(sublinkPath(fileBase),'r') as fTree:
RootDescendantID = fTree['RootDescendantID'][rowStart]
# re-calculate tree subset
rowStart = RowNum - (SubhaloID - RootDescendantID)
rowEnd = RowNum
nRows = rowEnd - rowStart + 1
# open single tree file and block load all required fields from the tree
with h5py.File(sublinkPath(fileBase),'r') as fTree:
for field in fTree.keys():
data = fTree[field][rowStart:rowEnd+1]
# ...
Zephyr Penoyre
10 Sep '17
Hi Dylan,
I guess the point I was trying to make is that it's not currently possible to achieve this given the tools in illustris_python.sublink (as this returns a tree capped at some snapshot number). Seems like you and I both had to work around this by hand in our codes.
Anyway, here's the code to do it. Is the backend of the API as simple as just a bunch of python functions like this? If so that's really cool (and I'd love to have a browse of it sometime)
def returnSimpleTree(subhalo_id,snapshot_number):
# load sublink chunk offsets from header of first file
groupFile=basePath+'groups_'+str(snapshot_number)+'/'+'groups_'+str(snapshot_number)+'.0.hdf5'
with h5py.File(groupFile,'r') as f:
sublinkFileOffsets = f['Header'].attrs['FileOffsets_SubLink']
# calculate target sublink file chunk which contains this id
sublinkFileOffsets = int(subhalo_id) - sublinkFileOffsets
fileNum = np.max( np.where(sublinkFileOffsets >= 0) )
treeFile=basePath+'trees/SubLink/tree_extended.'+str(fileNum)+'.hdf5'
with h5py.File(treeFile,'r') as rawTree:
nFind=rawTree['SubfindID'][:]
nSnap=rawTree['SnapNum'][:]
nSub=rawTree['SubhaloID'][:]
nFirst=rawTree['FirstProgenitorID'][:]
nNext=rawTree['NextProgenitorID'][:]
nDesc=rawTree['DescendantID'][:]
#initialises the tree
thisTree=-1*np.ones((136,2),dtype=int)
thisTree[:,0]=np.arange(135,-1,-1)
#traces the tree back to the latest subhalo in the mpb
zIndex=np.argwhere((nFind==subhalo_id) & (nSnap==snapshot_number))[0][0]
thisIndex=zIndex
descSub=nDesc[zIndex]
thisSub=nSub[zIndex]
descIndex=zIndex+descSub-nSub[zIndex]
while ((nFirst[descIndex]==thisSub) & (nDesc[descIndex]!=-1)): #while the first progentior of each descendant is this subhalo
descSub=nDesc[descIndex]
thisSub=nSub[descIndex]
descIndex=descIndex+descSub-nSub[descIndex]
thisIndex=descIndex
thisSnap=nSnap[thisIndex]
thisFind=nFind[thisIndex]
thisTree[135-thisSnap,1]=thisFind # records subfind id of first step
#initialises the list of merging galaxies
if thisSnap!=135: #if it doesn't reach z=0 records the subhalo it merges with
descIndex=thisIndex+nDesc[thisIndex]-nSub[thisIndex]
descSnap=nSnap[descIndex]
descFind=nFind[descIndex]
mergerTree=[[descSnap,descIndex]]
else:
mergerTree=[]
while nFirst[thisIndex]!=-1: # goes through main progenitors
thisIndex=thisIndex+nFirst[thisIndex]-nSub[thisIndex] #index of next main step in main progenitor branch
thisSnap=nSnap[thisIndex] #snapshot of this main progenitor
thisFind=nFind[thisIndex] #subfind id of this main progenitor
thisTree[135-thisSnap,1]=thisFind # records subfind id of this main progenitor
nextIndex=thisIndex+0 #stupid python objects
while nNext[nextIndex]!=-1: # goes through merging halos (next progenitors)
nextIndex=nextIndex+nNext[nextIndex]-nSub[nextIndex] #index of next progenitor
#records details of these mergers
mergerSnap=nSnap[nextIndex]
mergerSub=nFind[nextIndex]
mergerTree.append([mergerSnap,mergerSub])
# got to the end of this branch, so must save it
filled=np.argwhere(thisTree[:,1]!=-1) # finds the snapshots during which the halo is in the tree
if filled.size==1:
thisTree=thisTree[filled[0],:]
elif thisTree[0,1]==-1: # if the tree doesn't make it to z=0
thisTree=thisTree[filled[0][0]:filled[-1][0]+1,:]
else:
thisTree=thisTree[0:filled[-1][0]+1,:] # if it does
mergerTree=np.array(mergerTree)
data={"Main":thisTree}
data['Mergers']=mergerTree
return data
Dylan Nelson
19 Sep '17
Hi Zephyr,
For Illustris-1 snapshot 135 subhalo 12345 I seem to get this error:
IndexError: index 0 is out of bounds for axis 0 with size 0
> /draco/u/dnelson/external.py(35)returnSimpleTree()
33
34 #traces the tree back to the latest subhalo in the mpb
---> 35 zIndex=np.argwhere((nFind==subhalo_id) & (nSnap==snapshot_number))[0][0]
Also, we cannot do this global load of the file:
with h5py.File(treeFile,'r') as rawTree:
nFind=rawTree['SubfindID'][:]
nSnap=rawTree['SnapNum'][:]
nSub=rawTree['SubhaloID'][:]
nFirst=rawTree['FirstProgenitorID'][:]
nNext=rawTree['NextProgenitorID'][:]
nDesc=rawTree['DescendantID'][:]
this is ~2GB of data read unnecessarily, which will take too long and too much memory, instead if this can be converted into the appropriate slice.
Whoops, I hit exactly the same error (not sure how that slipped me by the first time...)
I've updated it, hopefully fixing the error and also not opening the whole file. You seem to be using a command (sublinkPath()) that I'm not familiar with, do you have all the sublink files rolled into one?
Anyway, let me know if this one has any problems:
def returnSimpleTree(subhalo_id,snapshot_number):
# load sublink chunk offsets from header of first file
groupFile=basePath+'groups_'+str(snapshot_number)+'/'+'groups_'+str(snapshot_number)+'.0.hdf5'
with h5py.File(groupFile,'r') as f:
subhaloFileOffsets = f['Header'].attrs['FileOffsets_Subhalo']
treeFileOffsets = f['Header'].attrs['FileOffsets_SubLink']
# calculate target group catalog file chunk which contains this id
subhaloFileOffsets = int(subhalo_id) - subhaloFileOffsets
fileNum = np.max( np.where(subhaloFileOffsets >= 0) )
subhaloFile=basePath+'groups_'+str(snapshot_number)+'/'+'groups_'+str(snapshot_number)+'.'+str(fileNum)+'.hdf5'
subhaloOffset=subhaloFileOffsets[fileNum]
#finding the right file for this tree and where exactly to look
with h5py.File(subhaloFile,'r') as groupFile:
rowNum=groupFile["Offsets"]['Subhalo_SublinkRowNum'][subhaloOffset]
lastProgId=groupFile["Offsets"]['Subhalo_SublinkLastProgenitorID'][subhaloOffset]
subhaloId=groupFile["Offsets"]['Subhalo_SublinkSubhaloID'][subhaloOffset]
treeFileOffsets=int(rowNum)-treeFileOffsets
treeFileNum=np.max(np.where(treeFileOffsets >= 0))
treeFile=basePath+'trees/SubLink/tree_extended.'+str(treeFileNum)+'.hdf5'
rowStart = treeFileOffsets[treeFileNum]
with h5py.File(treeFile,'r') as rawTree:
#finding which entries in tree we're interested in
firstId = rawTree['RootDescendantID'][rowStart]
rowStart=rowStart+(firstId-subhaloId)
lastId = rawTree['LastProgenitorID'][rowStart]
rowEnd=rowStart+lastId-firstId+1
nFind=rawTree['SubfindID'][rowStart:rowEnd]
nSnap=rawTree['SnapNum'][rowStart:rowEnd]
nSub=rawTree['SubhaloID'][rowStart:rowEnd]
nFirst=rawTree['FirstProgenitorID'][rowStart:rowEnd]
nNext=rawTree['NextProgenitorID'][rowStart:rowEnd]
nDesc=rawTree['DescendantID'][rowStart:rowEnd]
#initialises the tree
thisTree=-1*np.ones((136,2),dtype=int)
thisTree[:,0]=np.arange(135,-1,-1)
#traces the tree back to the latest subhalo in the mpb
zIndex=np.argwhere((nFind==subhalo_id) & (nSnap==snapshot_number))[0][0]
thisIndex=zIndex
if thisIndex!=0:
descSub=nDesc[zIndex]
thisSub=nSub[zIndex]
descIndex=zIndex+descSub-nSub[zIndex]
while ((nFirst[descIndex]==thisSub) & (nDesc[descIndex]!=-1)): #while the first progentior of each descendant is this subhalo
descSub=nDesc[descIndex]
thisSub=nSub[descIndex]
descIndex=descIndex+descSub-nSub[descIndex]
thisIndex=descIndex
thisSnap=nSnap[thisIndex]
thisFind=nFind[thisIndex]
thisTree[135-thisSnap,1]=thisFind # records subfind id of first step
#initialises the list of merging galaxies
if thisSnap!=135: #if it doesn't reach z=0 records the subhalo it merges with
descIndex=thisIndex+nDesc[thisIndex]-nSub[thisIndex]
descSnap=nSnap[descIndex]
descFind=nFind[descIndex]
mergerTree=[[descSnap,descIndex]]
else:
mergerTree=[]
while nFirst[thisIndex]!=-1: # goes through main progenitors
thisIndex=thisIndex+nFirst[thisIndex]-nSub[thisIndex] #index of next main step in main progenitor branch
thisSnap=nSnap[thisIndex] #snapshot of this main progenitor
thisFind=nFind[thisIndex] #subfind id of this main progenitor
thisTree[135-thisSnap,1]=thisFind # records subfind id of this main progenitor
nextIndex=thisIndex+0 #stupid python objects
while nNext[nextIndex]!=-1: # goes through merging halos (next progenitors)
nextIndex=nextIndex+nNext[nextIndex]-nSub[nextIndex] #index of next progenitor
#records details of these mergers
mergerSnap=nSnap[nextIndex]
mergerSub=nFind[nextIndex]
mergerTree.append([mergerSnap,mergerSub])
# got to the end of this branch, so must save it
filled=np.argwhere(thisTree[:,1]!=-1) # finds the snapshots during which the halo is in the tree
if filled.size==1:
thisTree=thisTree[filled[0],:]
elif thisTree[0,1]==-1: # if the tree doesn't make it to z=0
thisTree=thisTree[filled[0][0]:filled[-1][0]+1,:]
else:
thisTree=thisTree[0:filled[-1][0]+1,:] # if it does
mergerTree=np.array(mergerTree)
data={"Main":thisTree}
data['Mergers']=mergerTree
return data
Dylan Nelson
25 Sep '17
Hi Zephyr,
Looks good, I have added this as a API endpoint sublink/simple.json, e.g.
If you can write some brief information, as to exactly what the return is/represents/was calculated, I will add this to the documentation.
Zephyr Penoyre
25 Sep '17
Hi Dylan, that's fantastic thank you!
For a description how about:
"retrieves the snapshot number and subfind ID of this subhalos main progenitor branch and of subhalos merging with this subhalo across all time (note that this will only start at the snapshot for which the subhalo comes into existence and may end before the final snapshot if this subhalo merges into a larger body, in which case the first entry in the mergers dictionary points to the subhalo into which it merges)"
I'm currently building a set of tools to make accessing illustris data easy and quick with only a laptop and an internet connection. They're mostly built, and can be found (and used!) here: https://github.com/zpenoyre/illustrisAPI
The hope is for them to become a major avenue for people to get at the data, requiring very little understanding of illustris's slightly byzantine (though only because of the scope of the project) data. Certainly all of my work with the simulation could be achieved via this package (saving hundreds of lost hours fighting with hdf5 cutouts and a million unit conversions) and I believe that will be the case for the vast majority of users.
It's in a completely usable state right now (and I'd love to hear people's experiences with it, get in touch if you're interested!) but there's still some things that need to be finished/ neatened/ improved.
The following is copied and pasted from an email correspondence with Dylan Nelson. He suggested I open it up to anyone interested on here. It's a (long, sorry) list of questions of things that would streamline the operations of this package. Not all are that feasible, but those who ask...
Faster requests - this question basically stems from my very patchy understanding of how the internet works, but is there any way to speed up the time each request takes. What is the bottleneck on this? (I have a nasty suspicion it may be the speed of light, or alternatively something on the user's side but I thought it a question worth asking)
Download speed - There seems to be a max download speed (of ~3Mbps) regardless of my internet connection, is it possible there's a bottleneck on the server side?
Long files - We talked about the download being interrupted for long files (which gives a "Truncated end of file" error when you try to open it). Is there any way to work around this? Obviously it's a much more complex bit of tech but a browser seems to be perfectly happy pausing and continuing a download, can the python requests package do the same?
Merger trees - Thanks for implementing the Main Descendant Branch! However, whilst it's a neat fix, it's got some real issues. Mostly that if the halo does not make it back to z=0 the download can become unexpectedly huge. More generally, while there's obviously a wealth of information in the merger trees, the form and content of the files is a real headache. The structure, whilst clever, is a real trial to learn. And there's this huge amount of extra info stored in the form of all the GroupCat content. I propose reformatting the main meat of the merger tree into three files:
1 - A main branch (per sublink subhalo ID) - (2xNsnaps) a simple list of the main progenitor branch of a halo, with the snapshot number and subfind id stored.
2 - A merger history (per sublink subhalo ID) - For each merging galaxy, lists the snapshot number and subfind id of every progenitor (except the main). I'd also like to add the snapshot number and id at the time of max stellar mass, and of infall into the halo, which is an easy addition.
3 - A lookup table (per snapshot) - Gives the sublink subhalo ID for a given subfind id, allowing you to find the above two files for a given snapshot and subfind id. This file could be superceeded by just adding this value to the data in e.g. "http://www.illustris-project.org/api/Illustris-1/snapshots/133/subhalos/9/" or even just attaching these two files to that subhalo in trees.
I've (mostly) written code that will make these files. If I finish it and send it to you would you be willing to generate these and put them on the API? This would allow people to access the history of one galaxy, and find and follow any merging galaxy, without any extra computation.
A smaller sample - At the moment a lot of the cost of downloading the halo/subhalo catalogs and the merger trees comes from the huge number of tiny and slightly (arguably) boring DM dominated halos. Could we think of some good criteria to try and weed out as many of these as we can without getting rid of any data people would find useful? Obviously both data sets can co-exist, but having a pared down version could hugely shrink the file size that most people need. One possible criteria could be to split by halos and subhalos with no stars. A seperate version of the group catalogs with only the halos/subhalos with stars would reduce the file size by a large factor (~4 for illustris 3 at z=0). A more universal divider, perhaps with criteria involving stellar, gas, DM and BH mass could probably be worked out also.
Other languages - you said at one point that this package can't be of general use if it's only in python. It feels like it should be possible to write IDL and matlab (any others?) wrappers that just run the python scripts and translate the returned numpy arrays. This is far from my field of expertise but in your eyes would that suffice? You'd still have to have python and a few packages installed but at least you could run it out the box in a matlab/IDL script.
Hi Zephyr,
Thanks for the thoughts, some comments:
Faster requests/download speed: this just comes down to resources - I designed the Illustris public data release with a budget of $0, meaning I had to make do with everything available. Then, for example, the database is only as fast as it is, and the internet connection likewise. But I don't think this is a real problem, and indeed haven't seen that this is a fundamental limitation of anyone's use of Illustris data. I expect that requests should generally be answerable in 200ms (e.g. 5 per second), and the link speed is likely capable of ~100 MB/s for access to the raw data files. This capacity is of course shared with everyone.
"Long files" (truncated/corrupted HDF5 files for large requests) - this is my primary interest of something to fix, and I will work on it.
Simplified merger trees:
a. the main branch is already in the tree, and accessible, so I'm not sure what exactly you mean? this data is insignificant in size, so there is no harm in having additional fields, which can just be ignored if you only want the snapshot numbers and subfind IDs.
b. merger history: "the snapshot number and subfind id of every progenitor" sounds like the complete tree to me? I'm not sure what you mean beyond this.
For "snapshot number and id at the time of max stellar mass, and of infall into the halo" this is essentially analysis, similar to what has been done for the "Stellar Assembly" supplementary catalog. If you want to compute values like this and make a catalog file, I am happy to add it in that fashion.
c. lookup table: this is already available and used in the public data scripts, we call these "Offsets" - see
Subhalo_SublinkSubhaloID
in the data documentation.Smaller sample: I see the point, but interesting objects for you are not interesting objects for everyone, so you're really just talking about making a sub-sample of the group catalog in general. I don't think we need to make and store various subsets of the catalogs, the assumption is that people will choose objects as they like. This can be done by searching in the API/search form, or by downloading the group catalog files and selecting in a script. I think, for your scripts, for sample construction they should simply download the group catalog files as needed and cache them, creating subsets on the fly as requested.
Other languages: this is just so that the data is accessible widely and without any specific dependencies. Since this is already the case and working, I don't think it needs to be re-architectured for other languages to run a python script and interpret the result, what is the goal here?
Hi Dylan, thanks for the response. Her are some follow up thoughts:
Faster requests/download speed: understood, thought this was a long shot
Long files: Good to hear your on the case. Some of my later comments are (slightly patchy) ways to counteract specifically this, i.e. making files smaller and thus less susceptible to incomplete downloads, but perhaps there are more universal solutions.
Simplified merger tree: This isn't a case of me trying to create data that isn't there (which is obviously impossible anyway!) just trying to put it in a form where it's convenient to access and easy to understand. Don't forget that right now the trees come padded with all the data from the subhalo catalogs as well! Let me reply to each point individually:
a) the main branch: As stated above, I fully acknowledge that this is already in the tree, but that at the moment it 1) requires a bit of background understanding of the tree structure to get out and b) can be found efficiently for galaxies at z=0, quite efficiently for some galaxies at other redshifts (using your main descendant branch) but for a reasonable fraction of galaxies retrieving this data is a significant download. For example, if I'm looking at a subhalo at z=1 which will end up in the merger tree of subhalo 0 at z=0 there is no way to trace it's main progenitor branch (which obviously will not make it all the way to z=0) without downloading the whole merger tree for the most massive subhalo (which is a huge file) and then needs a bit of sifting through to reproduce the mpb.
Instead, I'm proposing a simple file, called something like Tree_10478 (the exact number is arbitrary) which would read [snapshotNumber, subhaloNumber] for every entry in the mpb. I.e. if this subhalo exists between snapshots 80 and 90 and has subhalo number 1000 at snapshot 85 it might look (using some random numbers as an example) like:
80, 998 81, 998 82, 203 83, 205 84, 998 85, 1000 86, 1000 87, 1001 88, 1211 89, 1211
This is obviously a very small and easy to read file, and as long as there's one associated with every subhalo it's very easy to trace any subhalo without downloading and sorting through the trees
b) Merger history. Let's only focus on the moment of a merger according to subfind (ignoring the time of maximum stellar mass and infall) and look at what I'm suggesting and why it would be useful.
If every subhalo has it's main progenitor branch saved (as detailed in section a) then all we need to link and trace any merging galaxy is the subhalo and snapshot number of the merging galaxy (which can then be used to find the merging galaxies own mpb and trace that if desired).
Thus for the above halo, if we say it merges with three smaller galaxies, at snapshots 81, 84 and 88 we could have a merger tree, in a file called Mergers_10478 that looks like this (again the structure is [snapshotNumber, subhaloNumberOfMergingGalaxy] and the numbers are made up for demonstration purposes):
81, 1005 84, 8974 88, 132
This is clearly much smaller and easier to parse and understand than the whole tree, but contains all the necessary info to retrieve any information (and more) contained in the merger tree. Again, this is not trying to add any information, just turn the info that is there into the most accessible and manageable form.
c) Lookup table: Sorry, perhaps I should have been more clear, this table contains the indices needed to find the above files. This is just to close the loop so that the full tree for any subhalo, at any snapshot, can be retrieved.
A note on the philosophy of this approach: I think maybe we're not seeing eye to eye here as you're imagining people wanting to look at the whole tree for a subhalo (including all the tiny merging subhalos and their histories) where I'm imagining people being more interested in either instantaneous mergers (i.e. we care about the properties of all merging galaxies at the time of merger but not their histories) or significant mergers (i.e. we care about the properties and evolution of only a handful of the merging galaxies). Of course, all the data to follow every halo over the whole simulation is there (and still about equivalently easy to access as in the original trees) but I'm trying to bring to the forefront a cut-down and more easily understood and used version of the history.
Smaller sample: This was mostly just a suggestion to reduce download sizes. I agree that many people may want to look at many different facets of the data. I'm not for a moment suggesting replacing the old catalogs, just that a more streamlined version might be made and put online alongside the original. But maybe a consensus on what part of the data would be usefully left out cannot be reached.
It's simply a reaction to the fact that most users will necesarilly need to download some halo/subhalo catalog properties and that this is currently the biggest download they're likely to make. It's obviously easy to cut down the catalog once they've been downloaded, but by then the hard work's already been done.
Other languages: This was just me trying to work out how to make these tools useful to the widest audience, but it sounds like you might not think it worth it if there are still python dependencies.
whoops, in the above examples of what might be in the Tree and Merger file the linebreaks got deleted when I published the comment. Hopefully it's still relatively clear what was meant.
I think the most practical solution is: if you have code, or want to write code, to derive these things from the SubLink merger tree data, then if you send it to me I will integrate it into the API. Each should be a function e.g.
I hope it's clear what I mean? tree1 and tree2 are the available data, of course they are the same if snapshot_number=135 (z=0). Otherwise, tree2 is the parent tree which contains tree1.
Hi Dylan, Are you suggesting to make this a process that runs serverside and computes then returns the data at each call? If so that sounds like a great idea (provided the serverside operations are quick, but I think opening and analysing a tree should be doable pretty efficiently).
If so this function should be pretty easy to write, any language issues or dependencies I should be aware of? I'm guessing python 2 with numpy is fine
Hi yes that would be my suggestion
Forgive me if I'm being stupid, but I've now hit upon this issue a few times:
If I run loadTree for snapshot_number<135, there seems no easy way to recover the portion of the merger tree at later times.
obviously the tree contains "RootDescendantID", but this is a subhalo_ID (in sublink), there seems no obvious way to translate this into subhalo_id_RootDescendant and snapshot_RootDescendant.
Am I missing something, how are you doing it to recover the main descendant branch in the API?
These values such as
RootDescendantID
are indices into the SubLink tree. For example, if this is 9, then you need to load the 10th entry of e.g.SnapNum
andSubfindID
. This is, like everything else, a global index. For example, if the tree contains 2 files with 8 entries each, you need to load the 2nd entry of the second file.From the docs: "The number inside each circle from the figure is the unique ID (within the whole simulation) of the corresponding subhalo, which is assigned in a depth-first fashion. Numbering also indicates the on-disk storage ordering for the SubLink trees. For example, the main progenitor branch (from 5-7 in the example) and the full progenitor tree (from 5-13 in the example) are both contiguous subsets of each merger tree field, whose location and size can be calculated using these links."
For example the current codes of the API:
Hi Dylan, I guess the point I was trying to make is that it's not currently possible to achieve this given the tools in illustris_python.sublink (as this returns a tree capped at some snapshot number). Seems like you and I both had to work around this by hand in our codes.
Anyway, here's the code to do it. Is the backend of the API as simple as just a bunch of python functions like this? If so that's really cool (and I'd love to have a browse of it sometime)
Hi Zephyr,
For Illustris-1 snapshot 135 subhalo 12345 I seem to get this error:
Also, we cannot do this global load of the file:
this is ~2GB of data read unnecessarily, which will take too long and too much memory, instead if this can be converted into the appropriate slice.
Whoops, I hit exactly the same error (not sure how that slipped me by the first time...)
I've updated it, hopefully fixing the error and also not opening the whole file. You seem to be using a command (sublinkPath()) that I'm not familiar with, do you have all the sublink files rolled into one?
Anyway, let me know if this one has any problems:
Hi Zephyr,
Looks good, I have added this as a API endpoint
sublink/simple.json
, e.g.If you can write some brief information, as to exactly what the return is/represents/was calculated, I will add this to the documentation.
Hi Dylan, that's fantastic thank you! For a description how about: "retrieves the snapshot number and subfind ID of this subhalos main progenitor branch and of subhalos merging with this subhalo across all time (note that this will only start at the snapshot for which the subhalo comes into existence and may end before the final snapshot if this subhalo merges into a larger body, in which case the first entry in the mergers dictionary points to the subhalo into which it merges)"