Accessing data on the compute servers

Where are the data located?

Large data sets are located in the

/data/

directory. Typically, you should not depend on this directly, but use an environment variable DATADIR that is set to this directory and your code should use this environment variable to find the location of the top data directory. The same holds for the individual large data sets described below. All of these data might get moved around in the future, so by relying on environment variables your code will be resilient against such changes.

APOGEE data

The APOGEE data are located at $DATADIR/sdss/apogee following the layout described here. To use this data, you should set an environment variable SDSS_LOCAL_SAS_MIRROR to point to $DATADIR/sdss/apogee. This wil allow use of the apogee package, but will also be more generally useful.

We can download the APOGEE data with commands like

rsync -avz --include='*/' --include='*/*/' --include='apStar-*.fits' --include='asStar-*.fits' --exclude='*' rsync://SDSS_DATA_USERNAME@dtn01.sdss.org/sas/SDSS_DATA_RELEASE/apogee/spectro/redux/SDSS_DATA_REDUX/stars/ .

for apStar/asStar files in the stars/ directory and

rsync -avz --include='*/' --include='*/*/' --include='aspcapStar-*.fits' --include='allStar*.fits' --include='allVisit*.fits' --exclude='*' rsync://SDSS_DATA_USERNAME@dtn01.sdss.org/sas/SDSS_DATA_RELEASE/apogee/spectro/aspcap/SDSS_DATA_REDUX/SDSS_ASPCAP_REDUX/ .

for aspcapStar files and the allStar and allVisit files. In these commands, SDSS_DATA_USERNAME is the general SDSS username for accessing data and SDSS_DATA_RELEASE, SDSS_DATA_REDUX, and SDSS_ASPCAP_REDUX are the reduction identifiers (e.g., r12 and l33 for DR16). [Note that these commands have not been tested (much)].

Gaia data

The Gaia data are located at $DATADIR/Gaia following the layout of the Gaia data archive. To use this data as part of the gaia_tools package, you should set an environment variable GAIA_TOOLS_DATA to point to $DATADIR/data4code/gaia_tools.

The DR1 Gaia data were downloaded with the following commands (note that the DR1 data were later moved to a sub-directory gdr1):

curl -O http://cdn.gea.esac.esa.int/Gaia/tgas_source/fits/TgasSource_000-000-0[00-15].fits
curl -O http://cdn.gea.esac.esa.int/Gaia/gaia_source/fits/GaiaSource_000-0[00-19]-[000-255].fits
curl -O http://cdn.gea.esac.esa.int/Gaia/gaia_source/fits/GaiaSource_000-020-[000-110].fits

The DR2 Gaia data were downloaded using

wget -r -nH -nc --cut-dirs=2 --no-parent --reject="index.html*",vot.gz http://cdn.gea.esac.esa.int/Gaia/gdr2/

although rejecting the index.html files seems not to have worked! The DR2 data are also loaded into the catalogs database as described below. We also downloaded the Gaia DR2 2MASS match using queries like

SELECT tmass_match.*, tmass.designation, tmass.ra, tmass.dec, tmass.j_m, 
tmass.j_msigcom, tmass.h_m, tmass.h_msigcom, tmass.ks_m, tmass.ks_msigcom, 
tmass.ext_key, tmass.ph_qual
FROM gaiadr2.tmass_best_neighbour AS tmass_match
INNER JOIN gaiadr1.tmass_original_valid AS tmass ON tmass.tmass_oid = tmass_match.tmass_oid
WHERE tmass_match.source_id BETWEEN MIN_SOURCE_ID AND MAX_SOURCE_ID;

in chunks of 17 gaia_source CSV files (with source_id between
MIN_SOURCE_ID and MAX_SOURCE_ID).

2MASS, Gaia DR2, and other catalogs

2MASS data are located at $DATADIR/2mass and have been directly downloaded from the 2MASS archive. The data have been loaded into a PostgreSQL database catalogs that is running on the server (as is usual, this database is owned by the user postgres). The point-sources table is twomass_psc; the schema for this table is here. You can use it for example as

> psql catalogs -U postgres

catalogs=# select count(*) as rows FROM twomass_psc;

   rows    

-————

 470992970

(1 row)

In Python, you can access this database using, e.g., the psycopg2 module

>>> import psycopg2

>>> conn= psycopg2.connect(“dbname=catalogs user=postgres”)

>>> cur= conn.cursor()

>>> cur.execute(“select count(*) as rows FROM twomass_psc;”)

>>> print(cur.fetchall())

[(470992970L,)]

>>> cur.close()

>>> conn.close()

A slightly more complicated computes the number counts in 0.1 mag bins in J:

>>> cur.execute(“select floor(j_m*10), count(*) as count from twomass_psc group by floor(j_m*10);”)

>>> a= numpy.array(cur.fetchall(),dtype=‘float’).T

>>> sindx= numpy.argsort(a[0])

>>> semilogy(a[0,sindx]/10.,a[1,sindx])

Note that these queries each take about five minutes to execute. The 2MASS point-source catalog comes with a table twomass_psc_hp12 that has the HEALPix index at level 12 in nested format for (RA,Dec). The HEALPix index is contained in hp12index and can be joined with twomass_psc on pts_key. For example, to get all stars in pixel 100 at level 10, you can run in Python

cur.execute("select ra,decl from twomass_psc, twomass_psc_hp12 where twomass_psc.pts_key = twomass_psc_hp12.pts_key and floor(hp12index/16) = 100;")

making use of the fact that pixel numbers at lower levels are simple divisions by a factor of four for each lower level.

The Gaia DR2 data have similarly been loaded into the catalogs database as table gaiadr2_gaia_source. The load command was

> psql -f gaia_source_schema catalogs

> find gaia_source/csv/ -name GaiaSource_*.csv.gz | xargs zcat | sed -e ‘/^solution_id/d’ | psql -c “copy gaiadr2_gaia_source from stdin with delimiter ‘,’ NULL as ’’” catalogs

using the schema in the $DATADIR/Gaia/gdr2 directory. The database was then configured for rapid access to random subsets of the data by doing

> psql catalogs -U postgres

catalogs=# create index random_index_idx on gaiadr2_gaia_source (random_index);

catalogs=# set maintenance_work_mem= ‘32GB’;

catalogs=# cluster verbose gaiadr2_gaia_source using random_index_idx;

The gaia_source_with_rv data was loaded into a similar table gaiadr2_gaia_source_with_rv.

The Gaia DR2 2MASS match was downloaded using the script available here and then loaded with the schema available here using

> psql -f tmass_schema catalogs -U postgres

> find tmass_best_neighbour/csv/ -name tmass_best_neighbour_*.csv.gz | xargs zcat | sed -e ‘/^source_id/d’ | psql -c “SET TRANSACTION READ WRITE; copy gaiadr2_tmass_best_neighbour from stdin with delimiter ‘,’ NULL as ’’” catalogs -U postgres

Similarly, the Gaia DR2 PanSTARRS1 match was downloaded using the script available here and then loaded with the schema available here using

> psql -f panstarrs1_schema catalogs -U postgres

> find panstarrs1_best_neighbour/csv/ -name panstarrs1_best_neighbour_*.csv.gz | xargs zcat | sed -e ‘/^source_id/d’ | psql -c “SET TRANSACTION READ WRITE; copy gaiadr2_panstarrs1_best_neighbour from stdin with delimiter ‘,’ NULL as ’’” catalogs -U postgres

Finally, the catalogs database was made read-only by default using

> psql catalogs -U postgres

catalogs=# set default_transaction_read_only= true;

but note that this can be turned off by simply running the last command with false instead. The Gaia EDR3 data in the database were loaded with similar commands and schemas as for the DR2 data as described above, except that when loading this data, the read-only default gave issues and it was necessary to alter the postgres user’s default permissions as

>psql catalogs -U postgres

catalogs=# alter user postgres set default_transaction_read_only= false;

and then to undo this after. This makes sense, since the default_transaction_read_only parameter is a session parameter, but it doesn’t make sense why this wasn’t necessary before…

To (re-)start the database, log in as the postgres user

su postgres

and start the database as

cd ~postgres && postgres -D /usr/local/pgsql/data >logfile 2>&1 &

Downloading data

If you require a large data set, it should be downloaded and kept under the DATADIR directory. Only users that are part of the datagrp group are allowed to write to DATADIR. Users are added to this group using

usermod -g datagrp USER