Analysis Configuration#

This framework defines physics analyses entirely through YAML configuration files. An analysis is composed of one or pipelines, each of which is comprised of one or more modules. In addition to the core analysis logic, we must additionally define the datasets which are processed and the executor used to process them.

Overview#

A configuration file is divided into the following logical sections:

  • (Optional) Common module blocks (anchors)

  • Analysis pipelines

  • Dataset-to-pipeline mapping

  • Execution backends

  • Other configuration options

Prebuilt Modules#

A full list of available modules and their configuration parameters can be found in the Built-in Modules documentation.

A Note on YAML Anchors and Reuse#

Common analysis logic can be factored into reusable blocks using YAML anchors:

common_cleanup: &common_cleanup
  - module_name: GoldenLumi
  - module_name: VetoMap

Anchors allow sharing logic across multiple pipelines without duplication. They are expanded inline at runtime. Note that these are purely a convience feature and can be ignored if you wish.

Simple Example#

The following snippet shows a complete (though very uninteresting) analysis.

analyzer:
  default_run_builder:
    strategy_name: NoSystematics
  simple_pipeline:
    - module_name: GoldenLumi
    - module_name: VetoMap
      input_col: Jet
    - module_name: NoiseFilter
    - module_name: VetoMapFilter
      input_col: Jet
      output_col: Jet
    - module_name: SelectOnColumns
      sel_name: pre_selection
    - module_name: JetFilter
      input_col: FatJet
      output_col: GoodFatJet
      min_pt: 200
      max_abs_eta: 2.4


event_collections:
  - dataset: 'data_JetHT_2018'
    pipelines: [simple_pipeline]

extra_executors:
  test:
    executor_name: ImmediateExecutor
    chunk_size: 10000

location_priorities: [".*(T0|T1|T2).*","eos"]

There are 4 top levels headings:

  • The configuration of the analyzer itself.

  • A mapping between datasets and the pipelines they should be processed with.

  • A list of additional excutors to make available.

  • A list of locations to prioritize when retrieving remote files through xrootd.

The most important item is the analyzer. The default_run_builder parameter defines the default strategy to use for systematics. All other subheadings are the names of pipelines followed by the modules that make them up. For example, in this analysis we define a single pipeline called simple_pipeline which contains 5 modules that perform basic cleaning on the events. The heart of an analysis is the modules that make up the pipelines, and understanding how they work is key. Even in this simple example, we can identify a number of important features, that generalize even to complex analyses with potentially dozens of modules:

  • The name of module to be used is always given by the property module_name.

  • Modules are configurable, all properties other than module_name are configuration parameters.

  • Many modules have configurable input and/or output columns.

  • The act of creating a boolean mask and the act of applying the selection are separate. In this example, the first three modules create boolean filters that are then applied only when the SelectOnColumns module is run.