.. _tutorial-create: ================================ Create, modify and organize data ================================ To begin, we need some sample data to work with. You may use your own reads (.fastq) files, or download an example set we have provided: .. literalinclude:: files/tutorial-create.py :lines: 2-13 .. note:: To avoid copy-pasting of the commands, you can :download:`download all the code ` used in this section. Organize resources ================== Before all else, one needs to prepare space for work. In our case, this means creating a "container" where the produced data will reside. So let's create a collection and than put some data inside! .. literalinclude:: files/tutorial-create.py :lines: 15-16 Upload files ============ We will upload fastq single end reads with the `upload-fastq-single`_ process. .. _upload-fastq-single: http://resolwe-bio.readthedocs.io/en/latest/catalog-definitions.html#process-upload-fastq-single .. literalinclude:: files/tutorial-create.py :lines: 18-25 What just happened? First, we chose a process to run, using its slug ``upload-fastq-single``. Each process requires some inputs---in this case there is only one input with name ``src``, which is the location of reads on our computer. Uploading a fastq file creates a new ``Data`` on the server containing uploaded reads. Additionally, we ensured that the new ``Data`` is put inside ``test_collection``. The upload process also created a Sample object for the reads data to be associated with. You can access it by: .. literalinclude:: files/tutorial-create.py :lines: 27 .. note:: You can also upload your files by providing url. Just replace path to your local files with the url. This comes handy when your files are large and/or are stored on a remote server and you don't want to download them to your computer just to upload them to Resolwe server again... Modify data =========== Both ``Data`` with reads and ``Sample`` are owned by you and you have permissions to modify them. For example: .. literalinclude:: files/tutorial-create.py :lines: 29-31 Note the ``save()`` part! Without this, the change is only applied locally (on your computer). But calling ``save()`` also takes care that all changes are applied on the server. .. note:: Some fields cannot (and should not) be changed. For example, you cannot modify ``created`` or ``contributor`` fields. You will get an error if you try. Annotate Samples ================ The next thing to do after uploading some data is to annotate samples this data belongs to. This can be done by assigning a value to a predefined field on a given sample. See the example below. Each sample should be assigned a species. This is done by attaching the ``general.species`` field on a sample and assigning it a value, e.g. ``Homo sapiens``. .. literalinclude:: files/tutorial-create.py :lines: 33 Annotation Fields ----------------- You might be wondering why the example above requires ``general.species`` string instead of e.g. just ``species``. The answer to this are ``AnnotationField``\ s. These are predefined *objects* that are available to annotate samples. These objects primarily have a name, but also other properties. Let's examine some of those: .. literalinclude:: files/tutorial-create.py :lines: 35-42 .. note:: Each field is uniquely defined by the combination of ``name`` and ``group``. If you wish to examine what fields are available, use a query .. literalinclude:: files/tutorial-create.py :lines: 44-46 You may be wondering whether you can create your own fields / groups. The answer is no. Time has proven that keeping things organized requires the usage of a selected set of predefined fields. If you absolutely feel that you need an additional annotation field, let us know or use resources such as :ref:`metadata`. Annotation Values ----------------- As mentioned before, fields are only one part of the annotation. The other part are annotation values, stored as a standalone resource ``AnnotationValues``. They connect the field with the actual value. .. literalinclude:: files/tutorial-create.py :lines: 48-55 As a shortcut, you can get all the ``AnnotationValue``\ s for a given sample by: .. literalinclude:: files/tutorial-create.py :lines: 57 Helper methods -------------- Sometimes it is convenient to represent the annotations with the dictionary, where keys are field names and values are annotation values. You can get all the annotation for a given sample in this format by calling: .. literalinclude:: files/tutorial-create.py :lines: 58 Multiple annotations stored in the dictionary can be assigned to sample by: .. literalinclude:: files/tutorial-create.py :lines: 59-62 Annotation is deleted from the sample by setting its value to ``None`` when calling ``set_annotation`` or ``set_annotations`` helper methods. To avoid confirmation prompt, you can set ``force=True``. .. literalinclude:: files/tutorial-create.py :lines: 63 Run analyses ============ Various bioinformatic processes are available to properly analyze sequencing data. Many of these pipelines are available via Resolwe SDK, and are listed in the `Process catalog`_ of the `Resolwe Bioinformatics documentation`_. .. _Process catalog: http://resolwe-bio.readthedocs.io/en/latest/catalog.html .. _Resolwe Bioinformatics documentation: http://resolwe-bio.readthedocs.io After uploading reads file, the next step is to align reads to a genome. We will use STAR aligner, which is wrapped in a process with slug ``alignment-star``. Inputs and outputs of this process are described in `STAR process catalog`_. We will define input files and the process will run its algorithm that transforms inputs into outputs. .. _STAR process catalog: https://resolwe-bio.readthedocs.io/en/latest/catalog-definitions.html#process-alignment-star .. literalinclude:: files/tutorial-create.py :lines: 67-76 Lets take a closer look to the code above. We defined the alignment process, by its slug ``'alignment-star'``. For inputs we defined data objects ``reads`` and ``genome``. ``Reads`` object was created with 'upload-fastq-single' process, while ``genome`` data object was already on the server and we just used its slug to identify it. The ``alignment-star`` processor will automatically take the right files from data objects, specified in inputs and create output files: ``bam`` alignment file, ``bai`` index and some more... You probably noticed that we get the result almost instantly, while the typical assembling process runs for hours. This is because processing runs asynchronously, so the returned data object does not have an OK status or outputs when returned. .. literalinclude:: files/tutorial-create.py :lines: 78-85 Status ``OK`` indicates that processing has finished successfully, but you will also find other statuses. They are given with two-letter abbreviations. To understand their meanings, check the :obj:`status reference `. When processing is done, all outputs are written to disk and you can inspect them: .. literalinclude:: files/tutorial-create.py :lines: 87-88 Until now, we used ``run()`` method twice: to upload reads (yes, uploading files is just a matter of using an upload process) and to run alignment. You can check the full signature of the :obj:`run() ` method. Run workflows ============= Typical data analysis is often a sequence of processes. Raw data or initial input is analysed by running a process on it that outputs some data. This data is fed as input into another process that produces another set of outputs. This output is then again fed into another process and so on. Sometimes, this sequence is so commonly used that one wants to simplify it's execution. This can be done by using so called "workflow". Workflows are special processes that run a stack of processes. On the outside, they look exactly the same as a normal process and have a process slug, inputs... For example, we can run workflow "General RNA-seq pipeline" on our reads: .. literalinclude:: files/tutorial-create.py :lines: 90-100 Solving problems ================ Sometimes the data object will not have an "OK" status. In such case, it is helpful to be able to check what went wrong (and where). The :obj:`stdout() ` method on data objects can help---it returns the standard output of the data object (as string). The output is long but exceedingly useful for debugging. Also, you can inspect the info, warning and error logs. .. literalinclude:: files/tutorial-create.py :lines: 104-117