User Documentation

NESE Disk

NESE Disk provides file services for systems located in The Massachusetts Green High Performance Computing Center (MGHPCC) as well as data lake services via Globus. NESE disk is composed of a Ceph File system with gateways that support RADOS Block Devices and a CephFS service that supports the NESE Globus end point. Object Storage is implemented by servers with standard configurations that change over time as storage technology evolves. As of July 2023, total raw capacity was approximately 41 petabytes, with the ability to expand according to need.

At purchase time, PIs and organizations specify what fraction of storage is to be made available via the Globus data lake service and what fraction is to be made available as network attached storage.

For NESE Disk provided file services, you must have access to a system located in the MGHPCC. This could be a campus HPC system (such as Engaging or Unity), a departmental system, or a group server. This service does not provide compute or ways to access the storage without other servers being availalble in our data center. Contact your local campus support group for what systems are eligible for NESE Disk file services.

Once storage is purchased, your local campus research computing support group will mount the storage on the systems of your choosing. From here, you simply navigate to the mount point and use it just like you use $HOME or $SCRATCH directories.

Note

NESE Disk file services are not high-performance, parallel file systems. If you need parallel or high IOPS I/O performance, stage your data to a $SCRATCH filesystem before submitting a job.

For storage that is available via Globus, you will use Globus-enabled apps including the web applications, the CLI, or the SDK to interact with your data lake. You will not be provided interactive access and must use Globus-enabled clients. To access your data lake, you must know your Globus “Collection Name” that will be provided to you at allocation time.

Globus Data Lake services have use cases that complement those covered by standard NESE Disk file services. If you are only using servers or HPC systems in our data center, you probably want to simply use the file services discussed above.

Use cases:

Note

Storage that is available via Globus is ONLY available via Globus. No interactive access or shell access is available.

NESE Tape

NESE Tape provides higher density, lower cost storage, currently accessible via Globus. NESE Tape is composed of a tape system with several storage frames and 34 tape drives supporting up to 70 PB today with space available for expansion as needed.

Each NESE Tape allocation comes with a disk-based staging area that is available via Globus. Users write to the staging area and then the data is migrated to tape based on a storage lifecycle policy. The default quota on this area is 10 TB or 2% of tape capacity, whichever is larger. There is also a minimum temporary hard quota set to 4 x staging-area space to allow for short term movement of larger amounts of data.

There is also a quota on the number of files that can be stored per allocation. This quota is set to an average of 100 MB/file of the tape pool capacity associated with the fileset. For example, a tape allocation of 200 TB will have a default quota on number of files set to 2,000,000 (200 TB / 100 MB).

The storage lifecycle policy is:

  • Premigrate: all newly written files > 2 hr of modification time are copied to tape

  • Migrate 1: if fileset quota > 99%; files are stubbed (replaced with a small pointer) down to 75% quota

  • Migrate 2: files with access time age > 2 weeks are stubbed

  • Files < 100 MB copied to tape, but also remain on disk

“Stubbed files” are files that have only a reference left on disk and have been migrated to tape When the stubbed file is accessed via Globus, the file is retrieved by the tape robots, promoted to the disk-based staging area, and the stub is then replaced with the original file.

For more information on using Globus, see the using Globus section below.

Warning

DO NOT delete files through the Globus interface. This will simply delete the files on the disk-based staging area but will not remove files from tape. Deleting files through Globus will make the tape storage inaccessible.

Encryption

Transfers via Globus to NESE Tape are not encrypted in transit or at rest. If your data management plan requires encryption, you must encrypt your files before sending them to NESE Tape via Globus.

One easy way to do that is to use GPG with tar to create password-protected, encrypted tarballs. First prepare your directory to be archived and make sure you have the gpg command line tool installed.

Then, create your tarball and pipe the output to the gpg command. Here we are going to use a symmetric method so that all is required to unencrypt the tarball is the password. You will be prompted to enter your password twice to encrypt the tarball.

$ tar czvpf - /path/to/dir/archiveme | gpg --symmetric --cipher-algo aes256 -o myarchive.tar.gz.gpg

Now, you can copy myarchive.tar.gz.gpg to NESE Tape via Globus.

To extract the files run the below command and enter your password when prompted:

$ gpg -d myarchive.tar.gz.gpg | tar xzvf -

This is a simple example of how to encrypt data before sending it to NESE. If you require additional security or want to exchange encrypted data with other users without providing them passwords, public-key based, asymmetric cryptography can be used with gpg. If a public key is used to encrypt data (either locally or via a PGP keyserver), you must have both the private key and the passphrase to decrypt the data!

If you’d like help with using asymmetric encryption with your NESE workflows, email help@nese.mghpcc.org for assitance.

Warning

DO NOT lose your password to your encrypted tarballs. There is no way to recover the data without the password.

Globus

Whether you are using NESE Tape or the NESE Disk data lake services, you will use Globus. Globus can be used in two main ways, through a web interface or through command line tools.

Web Portal

The primary means to interact with Globus is through the web portal. Below, three different scenarios will be described for ways to access data to be sent to NESE services. No matter which method you are using for sending the data, you will log into the Globus.org web portal, click “Log in” in the upper right hand corner, and begin transferring data to NESE. You will also use the same portal for restoring data from NESE Tape or sending data from NESE Disk data lake services to another destination.

When working with the Web Portal, there are three different types of transfers.

  • Globus Connect Server to Globus Connect Server

  • Globus Connect Personal to Globus Connect Server

  • Local computer via web app to Globus Connect Server

Globus Connect Server (GCS) will be set up and maintained by systems administratos. In all three cases above, one of the GCSs will be the NESE Tape or Disk endpoints. When you are provided access to NESE services, you will be given a ‘Collection Name’. This will be the collection you write to for the data lake or archival services.

The primary way to transfer data will be from one Globus Connect Server to the NESE-hosted GCS. This method can be used to transfer data from campus or national systems to NESE or vice versa. In addition to your NESE collection name, you will need information on the Globus configration for the other target.

Globus information:

The second way to transfer files is from Globus Connect Personal to a Globus Connect Server hosted endpoint. Globus Connect Personal turns your laptop or other personal computer into a Globus endpoint with just a few clicks. With Globus Connect Personal you can share and transfer files to/from a local machine—campus server, desktop computer or laptop—even if it’s behind a firewall and you don’t have administrator privileges.

Globus Connect Personal uses the same authentication and provides access to your collections just like using two Globus Connect Server endpoints, however, it automatically suspends transfers when the computer sleeps and resumes when turned back on.

Globus Connect Personal can be installed for Mac OS X, for Linux including Debian and RedHat based distros and openSUSE, and Windows.

Note

You may only have a single install of Globus Connect Personal. Chose your system wisely.

Once you’ve installed Globus Connect Personal, you’ll be able to create a new collection for your laptop / desktop in Globus and create a bookmark. Now, you are able to use the Globus Web Portal to transfer files from this new collection (your laptop) to NESE tape and back. While you are still using the web app to initiate the transfer, the actual data is not sent using the web app. Your new personal endpoint connects to the NESE endpoint and transfers happen directly with support for suspend, resume, and the changing of networks for your laptop.

The final way to transfer data is from any other laptop or desktop that is not running Globus Connect Personal (GCP) to NESE-hosted Globus endpoints. While you can only have one GCP instance, you can still use the web portal to move data to and from NESE. This method is best used to download and upload modest amounts of data directly as it does not have many of the features of using GCS or GCP and your web browser window must stay open for the duration of the tranfer.

Regardless of the method you are using, go to Globus.org and click “Log in” in the upper right hand corner. Once logged in, search for your NESE allocation via the Collection Search dialog box. The collection name should have been provided to you at the time of NESE allocation.

Once you’ve located your share, click on it to load it into the File Manager app. Click “Bookmark” in the upper right hand side of the window and give it a Name such as “NESE Tape” and then click “Create Bookmark”.

From here, you can now upload or download data directly from your computer or setup a transfer from one collection to another.

Warning

File transfers to NESE-hosted endpoints are NOT encrypted. If you require encryption, you should encrypt the files yourself before sending to NESE.

Command Line Tools

In addition to the web portal, globus has a command line wrapper to their Python SDK.

Installing Globus CLI using pipx.

$ python3 -m pip install --user pipx
$ python3 -m pipx ensurepath
$ pipx install globus-cli

Alternatively, Globus can be installed using (mini)conda.

$ conda create -c conda-forge -n gcli globus-cli
$ conda activate gcli

Once installed, you now need to authenticate with globus.

$ globus login

By default, this will open up a web browser to globus.org and ask you to authenticate. If you are on a remote HPC system, such as engaging, this can be done in an Open OnDemand remote desktop.

Alternatively, you can specify an additional flag to generate a login URL.

$ globus login --no-local-server

This will generate an oauth2 globus.org authentication URL. Copy this URL into a web browser on your local laptop or desktop, authenticate as before, and in the browser you will be provided an authorization code. This code is valid for 10 minutes and must be copied and pasted back into the terminal that ran the globus login –no-local-server command.

Once completed, verify authentication.

$ globus whoami

From here, you can follow the Globus CLI QuickStart Guide.

In addition to the Globus CLI, there is a very powerful package, Archivetar, that is designed to be used with large volumes of data, Globus, and hierarchical storage systems (such as the one used by NESE tape).

Archivetar:

Globus References