Starspace

Starspace defines a basic, minimal, proof of concept schema for gene or protein expression data containing spatially localized information. This project consists of the following:

  1. Defines a standard schema to describe spots, spatial matrices, and cell regions
  2. Implements a library that reads and writes files in the defined schema, leveraging the zarr container to ensure scalability.
  3. To demonstrate the flexibility of the schema, converts data from a variety of published assay types, including Spatial Transcriptomics, CODEX, In-situ Sequencing, MERFISH, osmFISH, and starMAP
  4. Demonstrates how to visualize and interact with these data using common analysis packages, and convert the formats into loom and anndata objects, for downstream analysis in R and Python.

Contents

Schema

Starspace defines three basic object types that describe the features extracted from a typical spatial experiment: spots, cells, and cell x gene expression matrices. It defines standard terminology to describe the basic spatial information for each object, allowing interoperability between data from different assays, and, if such a standard form were adopted, would make these data accessible to tool chains.

This repository intentionally describes a schema but not a format – these data could be defined in any number of ways. This repository chooses to use zarr

Spots

Spots is a tabular data table, where each record describes a spot. The columns of this table have three required fields and several standardized, but optional fields:

[
  {
    "description": "Name of the gene, using standard symbols",
    "mode": "REQUIRED",
    "name": "gene_name",
    "type": "STRING"
  },
  {
    "description": "y-coordinate of the center of the spot in microns",
    "mode": "REQUIRED",
    "name": "y_spot_microns",
    "type": "FLOAT"
  },
  {
    "description": "x-coordinate of the center of the spot in microns",
    "mode": "REQUIRED",
    "name": "x_spot_microns",
    "type": "FLOAT"
  },
  {
    "description": "z-coordinate of the center of the spot in microns",
    "mode": "OPTIONAL",
    "name": "z_spot_microns",
    "type": "FLOAT"
  },
  {
    "description": "y-coordinate of the center of the spot in pixels",
    "mode": "REQUIRED",
    "name": "y_spot_pixels",
    "type": "FLOAT"
  },
  {
    "description": "x-coordinate of the center of the spot in pixels",
    "mode": "REQUIRED",
    "name": "x_spot_pixels",
    "type": "FLOAT"
  },
  {
    "description": "z-coordinate of the center of the spot in pixels",
    "mode": "OPTIONAL",
    "name": "z_spot_pixels",
    "type": "FLOAT"
  },
  {
    "description": "id of the region (e.g. cell) that this spot belongs to",
    "mode": "OPTIONAL",
    "name": "region_id",
    "type": "INT"
  },
  {
    "description": "y-coordinate of the region this spot falls inside in microns",
    "mode": "OPTIONAL",
    "name": "y_region_microns",
    "type": "FLOAT"
  },
  {
    "description": "x-coordinate of the region this spot falls inside in microns",
    "mode": "OPTIONAL",
    "name": "x_region_microns",
    "type": "FLOAT"
  },
  {
    "description": "z-coordinate of the region this spot falls inside in microns",
    "mode": "OPTIONAL",
    "name": "z_region_microns",
    "type": "FLOAT"
  },
  {
    "description": "y-coordinate of the region this spot falls inside in pixels",
    "mode": "OPTIONAL",
    "name": "y_region_pixels",
    "type": "FLOAT"
  },
  {
    "description": "x-coordinate of the region this spot falls inside in pixels",
    "mode": "OPTIONAL",
    "name": "x_region_pixels",
    "type": "FLOAT"
  },
  {
    "description": "z-coordinate of the region this spot falls inside in pixels",
    "mode": "OPTIONAL",
    "name": "z_region_pixels",
    "type": "FLOAT"
  },
  {
    "description": "quality of this spot",
    "mode": "OPTIONAL",
    "name": "quality",
    "type": "FLOAT"
  },
  {
    "description": "radius of the spot",
    "mode": "OPTIONAL",
    "name": "radius",
    "type": "FLOAT"
  },
  {
    "description": "field of view that this spot is associated with",
    "mode": "OPTIONAL",
    "name": "fov",
    "type": "INT"
  }
]

The axes of this object can optionally be named:

[
  {
    "description": "name of first spots axis, which contains an integer index",
    "mode": "OPTIONAL",
    "name": "spot_index",
    "type": "STRING"
  },
  {
    "description": "name of second spots axis, listing spot characteristics",
    "mode": "OPTIONAL",
    "name": "spot_characteristics",
    "type": "STRING"
  }
]

Regions

Regions stores a label image. Each pixel belonging to a cell is encoded using the same integer value. Each sequential object is labeled with the next smallest integer. Such an image allows for each intersection of spots and cells to create a count matrix, but such an image can also be overlaid on image data to verify that cells were properly segmented.

The axes of this object can optionally be named:

[
  {
    "description": "name of regions y-axis",
    "mode": "OPTIONAL",
    "name": "y_region",
    "type": "STRING"
  },
  {
    "description": "name of regions x-axis",
    "mode": "OPTIONAL",
    "name": "x_region",
    "type": "STRING"
  },
  {
    "description": "name of regions z-axis",
    "mode": "OPTIONAL",
    "name": "z_region",
    "type": "STRING"
  },
  {
    "description": "regions pixel size y",
    "mode": "OPTIONAL",
    "name": "region_pixel_size_y",
    "type": "FLOAT"
  },
  {
    "description": "regions pixel size x",
    "mode": "OPTIONAL",
    "name": "region_pixel_size_x",
    "type": "FLOAT"
  },
  {
    "description": "regions pixel size z",
    "mode": "OPTIONAL",
    "name": "region_pixel_size_z",
    "type": "FLOAT"
  }
]

Matrix

The matrix file is a traditional region x feature expression matrix. Its values can contain either count data (e.g. spots) or continuous data (e.g. protein intensities). Regions can represent cells, anatomical areas, or stereotyped super-cellular areas, like those measured by slide-seq or spatial transcriptomics. features can be protein or rna abundances, or counts of other anatomical structures aggregated over regions.

The matrix stores metadata for each region that describe characteristics of the region:

[
  {
    "description": "unique identifier for the region",
    "mode": "REQUIRED",
    "name": "region_id",
    "type": "INT"
  },
  {
    "description": "y-coordinate of the center of the region in microns",
    "mode": "REQUIRED",
    "name": "y_region_microns",
    "type": "FLOAT"
  },
  {
    "description": "x-coordinate of the center of the region, in microns",
    "mode": "REQUIRED",
    "name": "x_region_microns",
    "type": "FLOAT"
  },
  {
    "description": "z-coordinate of the center of the region in microns",
    "mode": "OPTIONAL",
    "name": "z_region_microns",
    "type": "FLOAT"
  },
  {
    "description": "y-coordinate of the center of the region in pixels",
    "mode": "OPTIONAL",
    "name": "y_region_pixels",
    "type": "FLOAT"
  },
  {
    "description": "x-coordinate of the center of the region, in pixels",
    "mode": "OPTIONAL",
    "name": "x_region_pixels",
    "type": "FLOAT"
  },
  {
    "description": "z-coordinate of the center of the region in pixels",
    "mode": "OPTIONAL",
    "name": "z_region_pixels",
    "type": "FLOAT"
  },
  {
    "description": "physical annotation for the region, e.g. 'brain white matter'",
    "mode": "OPTIONAL",
    "name": "physical_annotation",
    "type": "STRING"
  },
  {
    "description": "cell type annotation for the region",
    "mode": "OPTIONAL",
    "name": "type_annotation",
    "type": "STRING"
  },
  {
    "description": "group id for this cell, e.g. cluster id",
    "mode": "OPTIONAL",
    "name": "group_id",
    "type": "INT"
  },
  {
    "description": "field of view that this region was identified in",
    "mode": "OPTIONAL",
    "name": "fov",
    "type": "INT"
  },
  {
    "description": "area of the region in pixels",
    "mode": "OPTIONAL",
    "name": "area_pixels",
    "type": "FLOAT"
  },
  {
    "description": "area of the region in square micrometers",
    "mode": "OPTIONAL",
    "name": "area_sq_microns",
    "type": "FLOAT"
  }
]

The matrix also stores metadata that describe the features:

[
  {
    "description": "name of the feature (e.g. gene or protein name)",
    "mode": "REQUIRED",
    "name": "gene_name",
    "type": "STRING"
  }
]

The axes of the matrix can optionally be named:

[
  {
    "description": "name of first matrix axis, describing regions",
    "mode": "OPTIONAL",
    "name": "regions",
    "type": "STRING"
  },
  {
    "description": "name of second matrix axis, describing features",
    "mode": "OPTIONAL",
    "name": "features",
    "type": "STRING"
  }
]

Working with Data

The starspace library defines a very simple set of objects to read and manipulate the Spots, Matrix, and Regions objects. Each object subclasses an xarray.Dataset or xarray.DataArray object, meaning that they can be used the same way one would use an xarray object. For those more familiar with numpy or pandas, there are simple ways to drop out of xarray, and for those that are more familiar with R and wish to use that language, we show how to serialize each object into a format that can be loaded into R.

For ease of use, starspace packages some pre-formatted data, which is stored in starspace.data. These data are used in the below examples.

Matrix

Serialization options

starspace defines two special serialization routines for the Matrix object to improve usability with downstream genomics packages.

import starspace
matrix = starspace.data.osmFISH.matrix()

# save to loom for reading in R
matrix.to_loom("osmFISH.loom")

# convert to anndata for use with scanpy
adata = matrix.to_anndata()
# optionally, save to disk
adata.save("osmFISH.h5ad")

Because starspace subclasses xarray.DataArray, it can also take advantage of any of the xarray serialization routines, for example:

matrix.to_netcdf("osmFISH.nc")

Extracting column or row metadata

Turn row or column metadata into a tidy pandas.Dataframe:

import starspace
matrix = starspace.data.osmFISH.matrix()

# pandas dataframe
col_metadata = matrix.column_metadata()
# pandas dataframe
row_metadata = matrix.row_metadata()

To extract cell x gene expression data into a numpy.array:

import starspace
matrix = starspace.data.osmFISH.matrix()

# numpy array
data = matrix.values

For more information on how to work with xarray objects, see their documentation

Spots

Spots is a simple tidy columnar data file that records the positions and identity of each spot. Because of this structure, it is simple to turn it into a pandas.DataFrame:

import starspace
spots = starspace.data.osmFISH.spots()

# pandas dataframe
df = spots.to_dataframe()

From pandas, one an serialize the pandas.Dataframe a number of ways, including to .csv:

df.to_csv('osmFISH_spots.csv')

see the Pandas documentation for more information.

Regions

Regions is a Dask-serialized label image. We use dask to enable large images, often bigger than would fit in memory, to be easily manipulated. For images that fit in memory, they can be easily converted into numpy arrays for downstream processing:

import starspace
regions = starspace.data.osmFISH.regions()

# numpy array
data = regions.values

Conversion Scripts

The following directory contains examples to convert author-published results into the spatial schema. The majority of the scripts are very simple. Each script is named as follows and has at minimum the following contents:

<assay_name>_<first_author>_<year>_<journal>_<short_description>.py

  1. Link to original manuscript or preprint, if available, else data attribution information
  2. Checklist of available data, including:
  1. cell (or region) x gene count matrix
  2. transcript locations (if appropriate)
  3. cell locations in polygons or masks
  1. Instructions to load and convert data into required format, including any information acquired via direct communications with authors.

Spatially resolved, highly multiplexed RNA profiling in single cells

Rongqin Ke, Marco Mignardi, Alexandra Pacureanu, Jessica Svedlund, Johan Botling, Carolina Wählby, Mats Nilsson

This publication can be found at https://science.sciencemag.org/content/348/6233/aaa6090 and the data referenced below can be downloaded from

Checklist: - [x] point locations - [ ] cell locations - [ ] cell x gene expression matrix (derivable)

Load the data

import requests
from io import BytesIO

import pandas as pd

import starspace
from starspace.constants import *

response = requests.get(
    "https://d24h2xsgaj29mf.cloudfront.net/raw/iss_ke_2013_nat-methods_breast-cancer/all_spots.csv"
)

data = pd.read_csv(BytesIO(response.content))

column_map = {
    "gene": SPOTS_REQUIRED_VARIABLES.GENE_NAME.value,
    "x": SPOTS_REQUIRED_VARIABLES.X_SPOT.value,
    "y": SPOTS_REQUIRED_VARIABLES.Y_SPOT.value,
    "qual": SPOTS_OPTIONAL_VARIABLES.QUALITY.value,
    "fov": SPOTS_OPTIONAL_VARIABLES.FIELD_OF_VIEW.value,
    "gene_code": "gene_code",
    "barcode": "barcode",
}

authors = [
    "Rongqin Ke", "Marco Mignardi", "Alexandra Pacureanu", "Jessica Svedlund", "Johan Botling",
    "Carolina Wählby", "Mats Nilsson"
]
attributes = {
    REQUIRED_ATTRIBUTES.ASSAY: ASSAYS.ISS,
    REQUIRED_ATTRIBUTES.SAMPLE_TYPE: "Her2+ breast carcinoma",
    REQUIRED_ATTRIBUTES.AUTHORS: authors,
    REQUIRED_ATTRIBUTES.YEAR: 2013,
    REQUIRED_ATTRIBUTES.ORGANISM: "human",
    OPTIONAL_ATTRIBUTES.PUBLICATION_NAME: (
        "In situ sequencing for RNA analysis in preserved tissue and cells"
    ),
    OPTIONAL_ATTRIBUTES.PUBLICATION_URL: "https://www.nature.com/articles/nmeth.2563",
}

standard_columns = [column_map[c] for c in data.columns]
data.columns = standard_columns

spots = starspace.Spots.from_spot_data(data, attributes)

# s3_url = "s3://starfish.data.output-warehouse/iss_ke_2013_nat-methods_breast-cancer"
local_url = "iss_ke_2013_nat-methods_breast-cancer/"
spots.save_zarr(local_url)

Total running time of the script: ( 0 minutes 0.000 seconds)

Gallery generated by Sphinx-Gallery

High-throughput single-cell gene-expression profiling with multiplexed error-robust fluorescence in situ hybridization

Jeffrey R. Moffitt, Junjie Hao, Guiping Wang, Kok Hao Chen, Hazen P. Babcock, Xiaowei Zhuang

This publication can be found at https://www.pnas.org/content/113/39/11046 and the data referenced below can be downloaded from s3://starfish.data.published/MERFISH/20181005/starfish_results/published_MERFISH_decoded_results.csv

Checklist: - [x] point locations - [ ] cell locations - [ ] cell x gene expression matrix (derivable)

This file converts point locations constructed with a starfish pipeline that has 99.7% correspondence to Jeff Moffit’s original matlab processing of these same data. Minor deviations are the result of numerical differences in deconvolution algorithms between matlab and python.

Load the data

from io import BytesIO

import numpy as np
import pandas as pd
import requests

import starspace
from starspace.constants import *

response = requests.get(
    "https://d24h2xsgaj29mf.cloudfront.net/raw/merfish_moffit_2016_pnas_u2-os/"
    "published_MERFISH_decoded_results.csv"
)

data = pd.read_csv(BytesIO(response.content), index_col=0)

# convert distance to quality; we'll map the name to quality below
data['distance'] = 1 - data['distance']

# drop the passes_thresholds column, this data has been conditioned on that previously
assert np.all(data['passes_thresholds'])
data = data.drop('passes_thresholds', axis=1)

# drop z_spot, it's not informative
assert np.allclose(data['zc'], 0.0005)
data = data.drop('zc', axis=1)

column_map = {
    'radius': SPOTS_OPTIONAL_VARIABLES.RADIUS.value,
    'target': SPOTS_REQUIRED_VARIABLES.GENE_NAME.value,
    'distance': SPOTS_OPTIONAL_VARIABLES.QUALITY.value,
    'xc': SPOTS_REQUIRED_VARIABLES.X_SPOT.value,
    'yc': SPOTS_REQUIRED_VARIABLES.Y_SPOT.value
}

columns = [column_map[c] for c in data.columns]
data.columns = columns

attributes = {
    REQUIRED_ATTRIBUTES.ORGANISM: "human",
    REQUIRED_ATTRIBUTES.ASSAY: ASSAYS.MERFISH.value,
    REQUIRED_ATTRIBUTES.YEAR: 2016,
    REQUIRED_ATTRIBUTES.AUTHORS: [
        "Jeffrey R. Moffitt", "Junjie Hao", "Guiping Wang", "Kok Hao Chen", "Hazen P. Babcock",
        "Xiaowei Zhuang"
    ],
    REQUIRED_ATTRIBUTES.SAMPLE_TYPE: "osteosarcoma (bone, epithelial) cell line",
    OPTIONAL_ATTRIBUTES.PUBLICATION_NAME: (
        "High-throughput single-cell gene-expression profiling with multiplexed error-robust "
        "fluorescence in situ hybridization"
    ),
    OPTIONAL_ATTRIBUTES.PUBLICATION_URL: "https://www.pnas.org/content/113/39/11046"
}

spots = starspace.Spots.from_spot_data(data, attributes)
s3_url = "s3://starfish.data.output-warehouse/merfish-moffit-2016-pnas-u2os/"
url = "merfish-moffit-2016-pnas-u2os/"
spots.save_zarr(url)

Total running time of the script: ( 0 minutes 0.000 seconds)

Gallery generated by Sphinx-Gallery

Deep Profiling of Mouse Splenic Architecture with CODEX Multiplexed Imaging

Yury Goltsev, Nikolay Samusik, Julia Kennedy-Darling, Salil Bhate, Matthew Hale, Gustavo Vazquez, Sarah Black, Garry P. Nolan

The data can be downloaded here: http://welikesharingdata.blob.core.windows.net/forshare/index.html and the paper is available here: https://doi.org/10.1016/j.cell.2018.07.010

from collections import defaultdict
from io import BytesIO

import pandas as pd
import requests

import starspace
from starspace.constants import *

response = requests.get(
   "https://d24h2xsgaj29mf.cloudfront.net/raw/codex_goltsev_2018_cell_spleen/"
   "Suppl.Table2.CODEX_paper_MRLdatasetexpression.csv"
)
data = pd.read_csv(BytesIO(response.content))

attributes = {
    REQUIRED_ATTRIBUTES.ASSAY: ASSAYS.CODEX,
    REQUIRED_ATTRIBUTES.SAMPLE_TYPE: "spleen",
    REQUIRED_ATTRIBUTES.AUTHORS: [
        "Yury Goltsev", "Nikolay Samusik", "Julia Kennedy-Darling", "Salil Bhate", "Matthew Hale",
        "Gustavo Vazquez", "Sarah Black", "Garry P. Nolan"
    ],
    REQUIRED_ATTRIBUTES.YEAR: 2018,
    REQUIRED_ATTRIBUTES.ORGANISM: "mouse",
    OPTIONAL_ATTRIBUTES.PUBLICATION_URL: "https://doi.org/10.1016/j.cell.2018.07.010",
    OPTIONAL_ATTRIBUTES.PUBLICATION_NAME:
        "Deep Profiling of Mouse Splenic Architecture with CODEX Multiplexed Imaging",
}

dims = tuple(MATRIX_AXES)

x = data["X.X"]
y = data["Y.Y"]
z = data["Z.Z"]
group = data["niche cluster ID"]
metadata_col = data["sample_Xtile_Ytile"]
type_annotation = data["Imaging phenotype cluster ID"]

data = data.drop(
    ["X.X", "Y.Y", "Z.Z", "sample_Xtile_Ytile", "niche cluster ID", "Imaging phenotype cluster ID"],
    axis=1
)

additional_metadata = defaultdict(list)
for i, v in enumerate(metadata_col):
    sample_type, fov_x, fov_y = v.split('_')
    additional_metadata["sample_type"].append(sample_type)
    additional_metadata["fov_x"].append(int(fov_x.strip("X")))
    additional_metadata["fov_y"].append(int(fov_y.strip("Y")))
additional_metadata = pd.DataFrame(additional_metadata)

coords = {
    MATRIX_REQUIRED_REGIONS.REGION_ID: (MATRIX_AXES.REGIONS, data.index),
    MATRIX_REQUIRED_FEATURES.GENE_NAME: (MATRIX_AXES.FEATURES, data.columns),
    MATRIX_REQUIRED_REGIONS.X_REGION: (MATRIX_AXES.REGIONS, x),
    MATRIX_REQUIRED_REGIONS.Y_REGION: (MATRIX_AXES.REGIONS, y),
    MATRIX_OPTIONAL_REGIONS.Z_REGION: (MATRIX_AXES.REGIONS, z),
    MATRIX_OPTIONAL_REGIONS.GROUP_ID: (MATRIX_AXES.REGIONS, group),
    MATRIX_OPTIONAL_REGIONS.TYPE_ANNOTATION: (MATRIX_AXES.REGIONS, type_annotation),
    "fov_x": (MATRIX_AXES.REGIONS, additional_metadata["fov_x"]),
    "fov_y": (MATRIX_AXES.REGIONS, additional_metadata["fov_y"]),
    "sample_type": (MATRIX_AXES.REGIONS, additional_metadata["sample_type"])
}

matrix = starspace.Matrix.from_expression_data(data.values, coords, dims, attributes)
url = ("codex_goltsev_2018_cell_spleen/")
matrix.save_zarr(url=url)

Total running time of the script: ( 0 minutes 0.000 seconds)

Gallery generated by Sphinx-Gallery

Visualization and analysis of gene expression in tissue sections by spatial transcriptomics

Patrik L. Ståhl, Fredrik Salmén, Sanja Vickovic, Anna Lundmark, José Fernández Navarro, Jens Magnusson, Stefania Giacomello, Michaela Asp, Jakub O. Westholm4, Mikael Huss4, Annelie Mollbrink2, Sten Linnarsson, Simone Codeluppi, Åke Borg, Fredrik Pontén, Paul Igor Costea, Pelin Sahlén, Jan Mulder, Olaf Bergmann, Joakim Lundeberg, Jonas Frisén

this publication can be found at https://science.sciencemag.org/content/353/6294/78.long and the data referenced below can be downloaded from https://www.spatialresearch.org/resources-published-datasets/doi-10-1126science-aaf2403/

checklist: - [x] point locations - [x] cell locations (NA) - [x] cell x gene expression matrix (NA)

load the data

from io import BytesIO

import dask.array as da
import numpy as np
import pandas as pd
import requests
from skimage.transform import matrix_transform

import starspace
from starspace.constants import *

response = requests.get(
    "https://d24h2xsgaj29mf.cloudfront.net/raw/spatial_transcriptomics_stahl_2016/"
    "Rep1_MOB_count_matrix-1.tsv"
)
data = pd.read_csv(BytesIO(response.content), sep='\t', index_col=0)

attributes = {
    REQUIRED_ATTRIBUTES.AUTHORS: (
        "Patrik L. Ståhl", "Fredrik Salmén", "Sanja Vickovic", "Anna Lundmark",
        "José Fernández Navarro", "Jens Magnusson", "Stefania Giacomello", "Michaela Asp",
        "Jakub O. Westholm", "Mikael Huss", "Annelie Mollbrink", "Sten Linnarsson",
        "Simone Codeluppi", "Åke Borg", "Fredrik Pontén", "Paul Igor Costea", "Pelin Sahlén",
        "Jan Mulder", "Olaf Bergmann", "Joakim Lundeberg", "Jonas Frisén"
    ),
    REQUIRED_ATTRIBUTES.YEAR: 2016,
    REQUIRED_ATTRIBUTES.SAMPLE_TYPE: "Olfactory Bulb",
    REQUIRED_ATTRIBUTES.ORGANISM: "mouse",
    REQUIRED_ATTRIBUTES.ASSAY: ASSAYS.SPATIAL_TRANSCRIPTOMICS.value,
    OPTIONAL_ATTRIBUTES.PUBLICATION_NAME: (
        "Visualization and analysis of gene expression in tissue sections by spatial "
        "transcriptomics"
    ),
    OPTIONAL_ATTRIBUTES.PUBLICATION_URL: "https://science.sciencemag.org/content/353/6294/78.long"
}
# convert the spots data
# cells maybe need a radius?

# transform coordinates
response = requests.get(
    "https://d24h2xsgaj29mf.cloudfront.net/raw/spatial_transcriptomics_stahl_2016/"
    "Rep1_MOB_transformation.txt"
)
transform = np.array([float(v) for v in response.content.decode().strip().split()]).reshape(3, 3).T

x, y = zip(*[map(float, v.split('x')) for v in data.index])

xy = np.hstack([
    np.array(x)[:, None],
    np.array(y)[:, None],
])

transformed = matrix_transform(xy, transform)

dims = (MATRIX_AXES.REGIONS.value, MATRIX_AXES.FEATURES.value)
coords = {
    MATRIX_REQUIRED_REGIONS.REGION_ID: (MATRIX_AXES.REGIONS, np.arange(data.shape[0])),
    MATRIX_REQUIRED_REGIONS.X_REGION: (MATRIX_AXES.REGIONS, transformed[:, 0]),
    MATRIX_REQUIRED_REGIONS.Y_REGION: (MATRIX_AXES.REGIONS, transformed[:, 1]),
    MATRIX_REQUIRED_FEATURES.GENE_NAME: (MATRIX_AXES.FEATURES, data.columns)
}
data = da.from_array(data.values, chunks=MATRIX_CHUNK_SIZE)

matrix = starspace.Matrix.from_expression_data(
    data=data, coords=coords, dims=dims, name="matrix", attrs=attributes
)
url = "spatial-transcriptomics-stahl-2016-science-olfactory-bulb"
matrix.save_zarr(url=url)

Total running time of the script: ( 0 minutes 0.000 seconds)

Gallery generated by Sphinx-Gallery

Visualization and analysis of gene expression in tissue sections by spatial transcriptomics

Patrik L. Ståhl, Fredrik Salmén, Sanja Vickovic, Anna Lundmark, José Fernández Navarro, Jens Magnusson, Stefania Giacomello, Michaela Asp, Jakub O. Westholm4, Mikael Huss4, Annelie Mollbrink2, Sten Linnarsson, Simone Codeluppi, Åke Borg, Fredrik Pontén, Paul Igor Costea, Pelin Sahlén, Jan Mulder, Olaf Bergmann, Joakim Lundeberg, Jonas Frisén

this publication can be found at https://science.sciencemag.org/content/353/6294/78.long and the data referenced below can be downloaded from https://www.spatialresearch.org/resources-published-datasets/doi-10-1126science-aaf2403/

checklist: - [x] point locations - [x] cell locations (NA) - [x] cell x gene expression matrix (NA)

load the data

from io import BytesIO

import dask.array as da
import numpy as np
import pandas as pd
import requests
from skimage.transform import matrix_transform

import starspace
from starspace.constants import *

response = requests.get(
    "https://d24h2xsgaj29mf.cloudfront.net/raw/spatial_transcriptomics_stahl_2016/"
    "Layer1_BC_count_matrix-1.tsv"
)
data = pd.read_csv(BytesIO(response.content), sep='\t', index_col=0)

attributes = {
    REQUIRED_ATTRIBUTES.AUTHORS: (
        "Patrik L. Ståhl", "Fredrik Salmén", "Sanja Vickovic", "Anna Lundmark",
        "José Fernández Navarro", "Jens Magnusson", "Stefania Giacomello", "Michaela Asp",
        "Jakub O. Westholm", "Mikael Huss", "Annelie Mollbrink", "Sten Linnarsson",
        "Simone Codeluppi", "Åke Borg", "Fredrik Pontén", "Paul Igor Costea", "Pelin Sahlén",
        "Jan Mulder", "Olaf Bergmann", "Joakim Lundeberg", "Jonas Frisén"
    ),
    REQUIRED_ATTRIBUTES.YEAR: 2016,
    REQUIRED_ATTRIBUTES.SAMPLE_TYPE: "prostate cancer",
    REQUIRED_ATTRIBUTES.ORGANISM: "human",
    REQUIRED_ATTRIBUTES.ASSAY: ASSAYS.SPATIAL_TRANSCRIPTOMICS.value,
    OPTIONAL_ATTRIBUTES.PUBLICATION_NAME: (
        "Visualization and analysis of gene expression in tissue sections by spatial "
        "transcriptomics"
    ),
    OPTIONAL_ATTRIBUTES.PUBLICATION_URL: "https://science.sciencemag.org/content/353/6294/78.long"
}
# convert the spots data
# cells maybe need a radius?

# transform coordinates
response = requests.get(
    "https://d24h2xsgaj29mf.cloudfront.net/raw/spatial_transcriptomics_stahl_2016/"
    "Layer1_BC_transformation.txt"
)
transform = np.array([float(v) for v in response.content.decode().strip().split()]).reshape(3, 3).T

x, y = zip(*[map(float, v.split('x')) for v in data.index])

xy = np.hstack([
    np.array(x)[:, None],
    np.array(y)[:, None],
])

transformed = matrix_transform(xy, transform)

dims = (MATRIX_AXES.REGIONS.value, MATRIX_AXES.FEATURES.value)
coords = {
    MATRIX_REQUIRED_REGIONS.REGION_ID: (MATRIX_AXES.REGIONS, np.arange(data.shape[0])),
    MATRIX_REQUIRED_REGIONS.X_REGION: (MATRIX_AXES.REGIONS, transformed[:, 0]),
    MATRIX_REQUIRED_REGIONS.Y_REGION: (MATRIX_AXES.REGIONS, transformed[:, 1]),
    MATRIX_REQUIRED_FEATURES.GENE_NAME: (MATRIX_AXES.FEATURES, data.columns)
}
data = da.from_array(data.values, chunks=MATRIX_CHUNK_SIZE)

matrix = starspace.Matrix.from_expression_data(
    data=data, coords=coords, dims=dims, name="matrix", attrs=attributes
)
url = "spatial-transcriptomics-stahl-2016-science-prostate-cancer"
matrix.save_zarr(url=url)

Total running time of the script: ( 0 minutes 0.000 seconds)

Gallery generated by Sphinx-Gallery

Molecular, spatial, and functional single-cell profiling of the hypothalamic preoptic region

Jeffrey R. Moffitt, Dhananjay Bambah-Mukku, Stephen W. Eichhorn, Eric Vaughn, Karthik Shekhar, Julio D. Perez, Nimrod D. Rubinstein, Junjie Hao, Aviv Regev, Catherine Dulac, Xiaowei Zhuang

This publication can be found at https://science.sciencemag.org/content/362/6416/eaau5324 and the data referenced below can be downloaded from https://datadryad.org/handle/10255/dryad.192644

Checklist: - [ ] point locations - [ ] cell locations - [x] cell x gene expression matrix

Load the data

import os
import requests
from io import BytesIO

import dask.array as da
import numpy as np
import pandas as pd

import starspace
from starspace.constants import *

response = requests.get(
    "https://d24h2xsgaj29mf.cloudfront.net/raw/merfish_moffit_2018_science_hypothalamic-preoptic/"
    "Moffitt_and_Bambah-Mukku_et_al_merfish_all_cells.csv"
)
data = pd.read_csv(BytesIO(response.content), header=0)

name = "merfish moffit 2018 science hypothalamic preoptic"

This data file is a cell x gene expression matrix that contains additional metadata as columns of the matrix. Extract those extra columns and clean up the data file.

annotation = np.array(data["Cell_class"], dtype="U")
group_id = np.array(data["Neuron_cluster_ID"], dtype="U")
x = data["Centroid_X"]
y = data["Centroid_Y"]
region_id = np.array(data["Cell_ID"], dtype="U")

unstructured_field_names = ["Animal_ID", "Animal_sex", "Behavior", "Bregma"]
unstructured_metadata = data[unstructured_field_names]
non_expression_fields = (
        unstructured_field_names
        + ["Cell_class", "Neuron_cluster_ID", "Centroid_X", "Centroid_Y", "Cell_ID"]
)
expression_data = data.drop(non_expression_fields, axis=1)
gene_name = [v.lower() for v in expression_data.columns]

Write down some important metadata from the publication.

attrs = {
    REQUIRED_ATTRIBUTES.ASSAY: ASSAYS.MERFISH,
    REQUIRED_ATTRIBUTES.SAMPLE_TYPE: "hypothalamic pre-optic nucleus",
    REQUIRED_ATTRIBUTES.AUTHORS: [
        "Jeffrey R. Moffitt", "Dhananjay Bambah-Mukku", "Stephen W. Eichhorn", "Eric Vaughn",
        "Karthik Shekhar", "Julio D. Perez", "Nimrod D. Rubinstein", "Junjie Hao", "Aviv Regev",
        "Catherine Dulac", "Xiaowei Zhuang"
    ],
    REQUIRED_ATTRIBUTES.YEAR: 2018,
    REQUIRED_ATTRIBUTES.ORGANISM: "mouse",
    OPTIONAL_ATTRIBUTES.PUBLICATION_NAME: (
        "Molecular, spatial, and functional single-cell profiling of the hypothalamic preoptic "
        "region"
    ),
    OPTIONAL_ATTRIBUTES.PUBLICATION_URL: "https://science.sciencemag.org/content/362/6416/eaau5324",
}

Create the chunked dataset.

chunk_data = da.from_array(expression_data.values, chunks=MATRIX_CHUNK_SIZE)

Wrap the dask array in an xarray, adding the metadata fields as “coordinates”.

# convert columns with object dtype into fixed-length strings

coords = {
    MATRIX_REQUIRED_FEATURES.GENE_NAME: (MATRIX_AXES.FEATURES.value, gene_name),
    MATRIX_REQUIRED_REGIONS.X_REGION: (MATRIX_AXES.REGIONS.value, x),
    MATRIX_REQUIRED_REGIONS.Y_REGION: (MATRIX_AXES.REGIONS.value, y),
    MATRIX_REQUIRED_REGIONS.REGION_ID: (MATRIX_AXES.REGIONS.value, region_id),
    MATRIX_OPTIONAL_REGIONS.GROUP_ID: (MATRIX_AXES.REGIONS.value, group_id),
    MATRIX_OPTIONAL_REGIONS.TYPE_ANNOTATION: (MATRIX_AXES.REGIONS.value, annotation)
}
dims = (MATRIX_AXES.REGIONS.value, MATRIX_AXES.FEATURES.value)
matrix = starspace.Matrix.from_expression_data(
    data=chunk_data, coords=coords, dims=dims, name=name, attrs=attrs
)

s3_url = "s3://starfish.data.output-warehouse/merfish-moffit-2018-science-hypothalamic-preoptic"
url = "merfish-moffit-2018-science-hypothalamic-preoptic"
matrix.save_zarr(url=url)

Total running time of the script: ( 0 minutes 0.000 seconds)

Gallery generated by Sphinx-Gallery

Spatially resolved, highly multiplexed RNA profiling in single cells

Kok Hao Chen, Alistair N. Boettiger, Jeffrey R. Moffitt, Siyuan Wang, Xiaowei Zhuang

This publication can be found at https://science.sciencemag.org/content/348/6233/aaa6090 and the data referenced below can be downloaded from

Checklist: - [x] point locations - [ ] cell locations - [x] cell x gene expression matrix (derivable)

Load the data

import requests
from io import BytesIO

import pandas as pd

import starspace
from starspace.constants import *

response = requests.get(
    "https://d24h2xsgaj29mf.cloudfront.net/raw/merfish_chen_2015_science_imr90/"
    "140genesData.xlsx"
)
data = pd.read_excel(BytesIO(response.content))

name = "merfish chen 2015 science imr90"

This data file is a cell x gene expression matrix that contains additional metadata as columns of the matrix. Extract those extra columns and clean up the data file.

# map column names to schema

column_map = {
    "RNACentroidX": SPOTS_REQUIRED_VARIABLES.X_SPOT,
    "RNACentroidY": SPOTS_REQUIRED_VARIABLES.Y_SPOT,
    "cellID": "per_slice_cell_id",  # this is not unique experiment-wide
    "CellPositionX": SPOTS_OPTIONAL_VARIABLES.X_REGION,
    "CellPositionY": SPOTS_OPTIONAL_VARIABLES.Y_REGION,
    "geneName": SPOTS_REQUIRED_VARIABLES.GENE_NAME,
    "experiment": "experiment",
    "library": "library",
    "intCodeword": "int_codeword",
    "isCorrectedMatch": "is_corrected_match",
    "isExactMatch": "is_exact_match"
}
columns = [column_map[c] for c in data.columns]
data.columns = columns

# demonstrate that cellID is not unique:
group_columns = (
    "per_slice_cell_id",
    SPOTS_OPTIONAL_VARIABLES.Y_REGION,
    SPOTS_OPTIONAL_VARIABLES.X_REGION,
)

# group by the columns, use size to run a no-op aggregation routine, then drop the size column
# (labeled zero)
not_unique = data.groupby(group_columns).size().reset_index().drop(0, axis=1)

assert_cols = ["per_slice_cell_id"]
assert not_unique[assert_cols].drop_duplicates().shape != not_unique[assert_cols].shape

# fix region ids so that they uniquely identify cells across the experiment.
group_columns = (
    "experiment", "library", "per_slice_cell_id",
    SPOTS_OPTIONAL_VARIABLES.Y_REGION, SPOTS_OPTIONAL_VARIABLES.X_REGION
)
region_ids_map = data.groupby(group_columns).size().reset_index().drop(0, axis=1)

assert_cols = ["per_slice_cell_id", "library", "experiment"]
assert region_ids_map[assert_cols].drop_duplicates().shape == region_ids_map[assert_cols].shape

# map each region to a unique identifier and add it to the data frame
region_ids_map = region_ids_map.drop(
    [SPOTS_OPTIONAL_VARIABLES.Y_REGION, SPOTS_OPTIONAL_VARIABLES.X_REGION], axis=1
)
region_ids_map = region_ids_map.reset_index().set_index(assert_cols)

region_ids = region_ids_map.loc[pd.MultiIndex.from_frame(data[assert_cols])]
data[SPOTS_OPTIONAL_VARIABLES.REGION_ID] = region_ids.values

Write down some important metadata from the publication.

attrs = {
    REQUIRED_ATTRIBUTES.ASSAY: ASSAYS.MERFISH,
    REQUIRED_ATTRIBUTES.SAMPLE_TYPE: "IMR90 lung fibroblast cell line",
    REQUIRED_ATTRIBUTES.AUTHORS: (
        "Kok Hao Chen", "Alistair N. Boettiger", "Jeffrey R. Moffitt", "Siyuan Wang",
        "Xiaowei Zhuang"
    ),
    REQUIRED_ATTRIBUTES.YEAR: 2015,
    REQUIRED_ATTRIBUTES.ORGANISM: "human",
    OPTIONAL_ATTRIBUTES.NOTES: (
        "cellID field from author data renamed per_slice_cell_id to reflect stored data"
    )
}

convert the dataframe into an xarray dataset

spots = starspace.Spots.from_spot_data(data, attrs)

Write the data to zarr on s3

s3_url = "s3://starfish.data.output-warehouse/merfish-chen-2015-science-imr90/"
url = "merfish-chen-2015-science-imr90/"
spots.save_zarr(url)

Convert the xarray dataset to a matrix.

matrix = spots.to_spatial_matrix()
matrix.save_zarr(url)

Total running time of the script: ( 0 minutes 0.000 seconds)

Gallery generated by Sphinx-Gallery

Modeling Spatial Correlation of Transcripts with Application to Developing Pancreas

Ruishan Liu, Marco Mignardi, Robert Jones, Martin Enge, Seung K. Kim, Stephen R. Quake & James Zou

This publication can be found at https://www.nature.com/articles/s41598-019-41951-2 and the data can be downloaded from https://cirm.ucsc.edu/projects

Checklist: - [x] point locations - [~] cell locations (centroids only) - [x] cell x gene expression matrix (derivable)

import requests
from pathlib import Path
from io import BytesIO

import pandas as pd

import starspace
from starspace.constants import *

response = requests.get(
    "https://d24h2xsgaj29mf.cloudfront.net/raw/iss_liu_2019_nat-sci-reports_pancreas-dev/"
    "Nuc_TOT_2p2.txt"
)
region_data = pd.read_csv(BytesIO(response.content))

response = requests.get(
    "https://d24h2xsgaj29mf.cloudfront.net/raw/iss_liu_2019_nat-sci-reports_pancreas-dev/"
    "RNA_TOT_2p2.txt"
)
rna_data = pd.read_csv(BytesIO(response.content))

response = requests.get(
    "https://d24h2xsgaj29mf.cloudfront.net/raw/iss_liu_2019_nat-sci-reports_pancreas-dev/"
    "Conversion_Pool2.txt"
)
gene_map = pd.read_csv(BytesIO(response.content))

Build the spot table

# some of these spots don't map to real genes. Interesting. Definitely retain "Barcode_Num" and
# "Barcode_Letter"
gene_map = gene_map.set_index("Barcode_Num")
gene_info = gene_map.loc[rna_data.Seq_num, :]
gene_info.index = rna_data.index

rna_data = pd.concat([rna_data, gene_info], axis=1)

# "ObjectNumber" is the join key for gene ids, but we've joined all the tables, so we can drop it.
rna_data = rna_data.drop("ObjectNumber", axis=1)

# merge in cell centroids
region_data = region_data.set_index("ObjectNumber")
region_data = region_data.drop("ImageNumber", axis=1)  # duplicated in rna_data
region_info = region_data.loc[rna_data["Parent_Cells"], :]
region_info.index = rna_data.index

rna_data = pd.concat([rna_data, region_info], axis=1)

notes = list()
notes.append("'seq_num' contains channel information for the in-situ sequencing code of each gene")
notes.append("'barcode_letter' contains the nucleotides read out using ISS")


column_map = {
    "ImageNumber": SPOTS_OPTIONAL_VARIABLES.FIELD_OF_VIEW,
    "Blob_X": SPOTS_REQUIRED_VARIABLES.X_SPOT,
    "Blob_Y": SPOTS_REQUIRED_VARIABLES.Y_SPOT,
    "Parent_Cells": SPOTS_OPTIONAL_VARIABLES.REGION_ID,
    "Location_Center_X": SPOTS_OPTIONAL_VARIABLES.X_REGION,
    "Location_Center_Y": SPOTS_OPTIONAL_VARIABLES.Y_REGION,
    "Gene_Name": SPOTS_REQUIRED_VARIABLES.GENE_NAME,
    "Seq_qual": SPOTS_OPTIONAL_VARIABLES.QUALITY,
    "Seq_num": "seq_num",
    "Barcode_Letter": "barcode_letter",
}

columns = [column_map[c] for c in rna_data.columns]
rna_data.columns = columns

attributes = {
    REQUIRED_ATTRIBUTES.ASSAY: ASSAYS.ISS.value,
    REQUIRED_ATTRIBUTES.SAMPLE_TYPE: "fetal pancreas",
    REQUIRED_ATTRIBUTES.AUTHORS: [
        "Ruishan Liu", "Marco Mignardi", "Robert Jones", "Martin Enge", "Seung K. Kim", "Stephen R. Quake" "James Zou"
    ],
    REQUIRED_ATTRIBUTES.YEAR: 2019,
    REQUIRED_ATTRIBUTES.ORGANISM: "human",
    OPTIONAL_ATTRIBUTES.PUBLICATION_NAME: (
        "Modeling Spatial Correlation of Transcripts with Application to Developing Pancreas"
    ),
    OPTIONAL_ATTRIBUTES.PUBLICATION_URL: "https://www.nature.com/articles/s41598-019-41951-2"
}

spots = starspace.Spots.from_spot_data(rna_data, attributes)


# s3_url = "s3://starfish.data.output-warehouse/iss_liu_2019_nat-sci-reports_pancreas-dev/"
url = "iss_liu_2019_nat-sci-reports_pancreas-dev/"
spots.save_zarr(url=url)

we have the needed information to pivot into a matrix, too

matrix = spots.to_spatial_matrix()
matrix.save_zarr(url=url)

Total running time of the script: ( 0 minutes 0.000 seconds)

Gallery generated by Sphinx-Gallery

spatial organization of the somatosensory cortex revealed by cyclic smfish

simone codeluppi, lars e. borm, amit zeisel, gioele la manno, josina a. van lunteren, camilla i. svensson, sten linnarsson

this publication can be found at https://www.nature.com/articles/s41592-018-0175-z and the data referenced below can be downloaded from http://linnarssonlab.org/osmfish/

checklist: - [x] point locations - [x] cell locations - [x] cell x gene expression matrix

load the data

import pickle
import re
import requests
from itertools import repeat
from io import BytesIO
import tempfile

import dask.array as da
import h5py
import loompy
import numpy as np
import pandas as pd

import starspace
from starspace.constants import *

response = requests.get(
    "https://d24h2xsgaj29mf.cloudfront.net/raw/"
    "osmfish_codeluppi_2018_nat-methods_somatosensory-cortex/"
    "mRNA_coords_raw_counting.hdf5"
)
spots_data = h5py.File(BytesIO(response.content), "r")

response = requests.get(
    "https://d24h2xsgaj29mf.cloudfront.net/raw/"
    "osmfish_codeluppi_2018_nat-methods_somatosensory-cortex/"
    "polyT_seg.pkl"
)
region_data = pickle.load(BytesIO(response.content))



# load spot info
gene = []
x = []
y = []
imaging_round = []
pattern = r"^(.*?)_Hybridization(\d*?)$"
for k in spots_data.keys():
    gene_, round_ = re.match(pattern, k).groups()
    gene_data = spots_data[k]
    x.extend(gene_data[:, 0])
    y.extend(gene_data[:, 1])
    gene.extend(repeat(gene_, gene_data.shape[0]))
    imaging_round.extend(repeat(int(round_), gene_data.shape[0]))

spots_data.close()

# build the spot information
spot_data = pd.DataFrame({
    SPOTS_REQUIRED_VARIABLES.GENE_NAME: gene,
    SPOTS_REQUIRED_VARIABLES.X_SPOT: x,
    SPOTS_REQUIRED_VARIABLES.Y_SPOT: y,
    SPOTS_OPTIONAL_VARIABLES.ROUND: imaging_round,
})

# construct the attributes
attributes = {
    REQUIRED_ATTRIBUTES.AUTHORS: (
        "Simone Codeluppi", "Lars E. Borm", "Amit Zeisel", "Gioele La Manno",
        "Josina A. van Lunteren", "Camilla I. Svensson", "Sten Linnarsson"
    ),
    REQUIRED_ATTRIBUTES.YEAR: 2018,
    REQUIRED_ATTRIBUTES.SAMPLE_TYPE: "somatosensory cortex",
    REQUIRED_ATTRIBUTES.ORGANISM: "mouse",
    REQUIRED_ATTRIBUTES.ASSAY: ASSAYS.OSMFISH.value,
    OPTIONAL_ATTRIBUTES.PUBLICATION_NAME: (
        "Spatial organization of the somatosensory cortex revealed by cyclic smFISH"
    ),
    OPTIONAL_ATTRIBUTES.PUBLICATION_URL: "https://www.nature.com/articles/s41592-018-0175-z"
}

spots = starspace.Spots.from_spot_data(spot_data, attrs=attributes)
s3_url = (
    "s3://starfish.data.output-warehouse/osmfish-codeluppi-2018-nat-methods-somatosensory-cortex/"
)
url = "osmfish-codeluppi-2018-nat-methods-somatosensory-cortex/"
spots.save_zarr(url=url)

load the region information; we’re gonna be lazy and just create a label image. Makes for simple lookups. It’s only 6 gb, and we can put it in dask, so w/e find the extent of the images from the spots and the region data

x_min, x_max = np.percentile(spot_data[SPOTS_REQUIRED_VARIABLES.X_SPOT], [0, 100])
y_min, y_max = np.percentile(spot_data[SPOTS_REQUIRED_VARIABLES.Y_SPOT], [0, 100])

label = np.empty((int(x_max) + 1, int(y_max) + 1), dtype=np.int16)

for region_id, array in region_data.items():
    region_id = int(region_id)
    x = array[:, 0]
    y = array[:, 1]
    label[x, y] = region_id

dims = tuple(REGIONS_AXES)

regions = starspace.Regions.from_label_image(label, dims=dims, attrs=attributes)
regions.save_zarr(url=url)

load up the count matrix

response = requests.get(
    "https://d24h2xsgaj29mf.cloudfront.net/raw/"
    "osmfish_codeluppi_2018_nat-methods_somatosensory-cortex/"
    "osmFISH_SScortex_mouse_all_cells.loom"
)
with tempfile.TemporaryDirectory() as tmpdirname:
    with open(os.path.join(tmpdirname, "temp.loom"), 'wb') as f:
        f.write(response.content)
    conn = loompy.connect(os.path.join(tmpdirname, "temp.loom"), mode="r")

    row_attrs = dict(conn.row_attrs)
    col_attrs = dict(conn.col_attrs)

    data = da.from_array(conn[:, :].T, chunks=MATRIX_CHUNK_SIZE)

# region id should be int dtype
col_attrs["CellID"] = col_attrs["CellID"].astype(int)

dims = (MATRIX_AXES.REGIONS.value, MATRIX_AXES.FEATURES.value)

coords = {
    MATRIX_REQUIRED_REGIONS.REGION_ID: (MATRIX_AXES.REGIONS, col_attrs["CellID"]),
    MATRIX_REQUIRED_REGIONS.X_REGION: (MATRIX_AXES.REGIONS, col_attrs["X"]),
    MATRIX_REQUIRED_REGIONS.Y_REGION: (MATRIX_AXES.REGIONS, col_attrs["Y"]),
    MATRIX_OPTIONAL_REGIONS.GROUP_ID: (MATRIX_AXES.REGIONS, col_attrs["ClusterID"]),
    MATRIX_OPTIONAL_REGIONS.TYPE_ANNOTATION: (MATRIX_AXES.REGIONS, col_attrs["ClusterName"]),
    MATRIX_OPTIONAL_REGIONS.PHYS_ANNOTATION: (MATRIX_AXES.REGIONS, col_attrs["Region"]),
    MATRIX_OPTIONAL_REGIONS.AREA_PIXELS: (MATRIX_AXES.REGIONS, col_attrs["size_pix"]),
    MATRIX_OPTIONAL_REGIONS.AREA_UM2: (MATRIX_AXES.REGIONS, col_attrs["size_um2"]),
    "valid": (MATRIX_AXES.REGIONS, col_attrs["Valid"]),
    "tsne_1": (MATRIX_AXES.REGIONS, col_attrs["_tSNE_1"]),
    "tsne_2": (MATRIX_AXES.REGIONS, col_attrs["_tSNE_2"]),
    MATRIX_OPTIONAL_FEATURES.CHANNEL: (MATRIX_AXES.FEATURES, row_attrs["Fluorophore"]),
    MATRIX_REQUIRED_FEATURES.GENE_NAME: (MATRIX_AXES.FEATURES, row_attrs["Gene"]),
    MATRIX_OPTIONAL_FEATURES.ROUND.value: (MATRIX_AXES.FEATURES, row_attrs["Hybridization"]),
}

matrix = starspace.Matrix.from_expression_data(
    data=data, coords=coords, dims=dims, name="matrix", attrs=attributes
)
matrix.save_zarr(url=url)

Total running time of the script: ( 0 minutes 0.000 seconds)

Gallery generated by Sphinx-Gallery

Gallery generated by Sphinx-Gallery

Analysis Examples

The following notebooks demonstrate how data formatted in the starspace schema can be read into python, visualized, and analyzed.

  1. codex.ipynb
  2. osmFISH.ipynb
  3. spatial_transcriptomics.ipynb

Contributing

This package is only as useful as the data that exist in the format it specifies. We eagerly encourage contribution of datasets, and would be happy to work to evolve the schema.

To contribute to the schema or library, or to add formatted data, please begin by opening an issue to discuss the proposed contribution. For data additions, the contribution process is simple: add a conversion example to /conversion_examples that reads the data from a publicly accessible repository. We will run the script, upload the data to the starspace amazon s3 bucket, and add the contributed data to starspace.data Optionally, you can add a notebook to /analysis_examples demonstrating how the data can be navigated and showing off the cool features of your spatial dataset!