Working with Data¶
The starspace library defines a very simple set of objects to read and manipulate the Spots
,
Matrix
, and Regions
objects. Each object subclasses an xarray.Dataset
or
xarray.DataArray
object, meaning that they can be used the same way one would use an xarray
object. For those more familiar with numpy
or pandas
, there are simple ways to drop out of
xarray
, and for those that are more familiar with R and wish to use that language, we show how to serialize
each object into a format that can be loaded into R.
For ease of use, starspace packages some pre-formatted data, which is stored in starspace.data
. These data
are used in the below examples.
Matrix¶
Serialization options¶
starspace defines two special serialization routines for the Matrix
object to improve usability with
downstream genomics packages.
import starspace
matrix = starspace.data.osmFISH.matrix()
# save to loom for reading in R
matrix.to_loom("osmFISH.loom")
# convert to anndata for use with scanpy
adata = matrix.to_anndata()
# optionally, save to disk
adata.save("osmFISH.h5ad")
Because starspace subclasses xarray.DataArray
, it can also take advantage of any of the
xarray serialization routines, for example:
matrix.to_netcdf("osmFISH.nc")
Extracting column or row metadata¶
Turn row or column metadata into a tidy pandas.Dataframe
:
import starspace
matrix = starspace.data.osmFISH.matrix()
# pandas dataframe
col_metadata = matrix.column_metadata()
# pandas dataframe
row_metadata = matrix.row_metadata()
To extract cell x gene expression data into a numpy.array
:
import starspace
matrix = starspace.data.osmFISH.matrix()
# numpy array
data = matrix.values
For more information on how to work with xarray
objects, see their documentation
Spots¶
Spots is a simple tidy columnar data file that records the positions and identity of each spot. Because of this
structure, it is simple to turn it into a pandas.DataFrame
:
import starspace
spots = starspace.data.osmFISH.spots()
# pandas dataframe
df = spots.to_dataframe()
From pandas, one an serialize the pandas.Dataframe
a number of ways, including to .csv
:
df.to_csv('osmFISH_spots.csv')
see the Pandas documentation for more information.
Regions¶
Regions is a Dask-serialized label image. We use dask to enable large images, often bigger than would fit in memory, to be easily manipulated. For images that fit in memory, they can be easily converted into numpy arrays for downstream processing:
import starspace
regions = starspace.data.osmFISH.regions()
# numpy array
data = regions.values