Analysis Configuration
======================

This framework defines physics analyses entirely through YAML configuration files.
An analysis is composed of one or *pipelines*, each of which is comprised of one or more *modules*.
In addition to the core analysis logic, we must additionally define the *datasets* which are processed and the *executor* used to process them.


Overview
--------

A configuration file is divided into the following logical sections:

- (Optional) Common module blocks (anchors)
- Analysis pipelines
- Dataset-to-pipeline mapping
- Execution backends
- Other configuration options


Prebuilt Modules
-----------------

A full list of available modules and their configuration parameters can be found in the :doc:`builtin_modules` documentation.


A Note on YAML Anchors and Reuse
-----------------------------------

Common analysis logic can be factored into reusable blocks using YAML anchors:

.. code-block:: yaml

   common_cleanup: &common_cleanup
     - module_name: GoldenLumi
     - module_name: VetoMap

Anchors allow sharing logic across multiple pipelines without duplication.
They are expanded inline at runtime.
Note that these are purely a convience feature and can be ignored if you wish.

Simple Example
---------------

The following snippet shows a complete (though very uninteresting) analysis.

.. code-block:: yaml

    analyzer:
      default_run_builder:
        strategy_name: NoSystematics
      simple_pipeline:
        - module_name: GoldenLumi
        - module_name: VetoMap
          input_col: Jet
        - module_name: NoiseFilter
        - module_name: VetoMapFilter
          input_col: Jet
          output_col: Jet
        - module_name: SelectOnColumns
          sel_name: pre_selection
        - module_name: JetFilter
          input_col: FatJet
          output_col: GoodFatJet
          min_pt: 200
          max_abs_eta: 2.4
    
    
    event_collections:
      - dataset: 'data_JetHT_2018'
        pipelines: [simple_pipeline]
      
    extra_executors:
      test:
        executor_name: ImmediateExecutor
        chunk_size: 10000

    location_priorities: [".*(T0|T1|T2).*","eos"]

There are 4 top levels headings:

- The configuration of the analyzer itself.
- A mapping between datasets and the pipelines they should be processed with.
- A list of additional excutors to make available.
- A list of locations to prioritize when retrieving remote files through xrootd.


The most important item is the analyzer.
The ``default_run_builder`` parameter defines the default strategy to use for systematics.
All other subheadings are the names of pipelines followed by the modules that make them up.
For example, in this analysis we define a single pipeline called ``simple_pipeline`` which contains 5 modules that perform basic cleaning on the events.
The heart of an analysis is the modules that make up the pipelines, and understanding how they work is key.
Even in this simple example, we can identify a number of important features, that generalize even to complex analyses with potentially dozens of modules:

- The name of module to be used is always given by the property ``module_name``.
- Modules are configurable, all properties other than ``module_name`` are configuration parameters.
- Many modules have configurable input and/or output columns.
- The act of creating a boolean mask and the act of applying the selection are separate. In this example, the first three modules create boolean filters that are then applied only when the ``SelectOnColumns`` module is run.