Method files
Description
The method file contains the fitting methods to be used during the fitting process. This is where you can:
- define which parameters are fit, fixed, or constrained,
- select the profiles to include in the calculation,
- activate additional calculations, such as grid search, Monte Carlo or bootstrap analyses.
This file is provided to ChemEx using the -m
or --method
option.
The method file is structured in sections. Each section corresponds to a fitting step. The section name is the name of the fitting step. The steps are run in the order they are defined.
If the fitting method contains multiple fitting steps, the value and behavior of each parameter always inherit from the previous fitting step if not set in the current step. This means that if a parameter is fixed in one step, it remains fixed in the following steps as long as its state is not changed.
The set of residues to be included in global fits should be chosen carefully. A
commonly used multi-step fitting strategy is to select a subset of residues with
relatively large CPMG dispersion or good quality CEST minor dips to first get
global parameters (pB, kex), and then carry out
single-residue fits with (pB, kex) fixed to get
residue-specific parameters (e.g., Δϖ) in the next step. In CPMG experiments, to
get reasonable initial estimates of Δϖ for each residue, an additional
single-residue fitting step can be carried out at the very beginning, see the
method file for CPMG_CH3_1H_SQ/
under Examples/Experiments/
for such an
example.
Example
Here is an example method file demonstrating the different possibilities:
[STEP1]
INCLUDE = [15, 31, 33, 34, 37] #, 33, 34, 37]
GRID = [
"[KEX_AB] = log(100.0, 600.0, 10)",
"[PB] = log(0.03, 0.15, 10)",
"[DW_AB] = lin(0.0, 10.0, 5)",
]
[STEP2]
FIT = ["PB", "KEX_AB", "DW_AB"]
STATISTICS = { "MC"=100, "BS"=100, "BSN"=100 }
[STEP3]
INCLUDE = "ALL"
FIX = ["PB", "KEX_AB"]
GRID = ["[DW_AB] = lin(0.0, 10.0, 20)"]
[STEP4]
FIT = ["DW_AB"]
This fit contains 4 distinct steps:
- A subset of profiles are selected, and a grid search is performed on the
parameters
"KEX_AB"
,"PB"
and"DW_AB"
. - The parameters
"KEX_AB"
,"PB"
and"DW_AB"
are set to vary. The selection of profiles remains the same as in the first step. - All profiles are included in the fit, the parameters
"KEX_AB"
and"PB"
are fixed, and a grid search is performed on the parameter"DW_AB"
. - The parameter
"DW_AB"
is set to vary.
Results are written on separate directories named according to the corresponding section name.
Setting parameter behavior
All model parameters can either be varied, fixed or constrained. The status of each parameter is defined by default at the beginning of the fitting process. However, you can change this behavior in each fitting step of the method file.
FIT
Parameters in the FIT
list are varied during the fitting process. Here,
parameter names can either designate a unique parameter or a group of
parameters. For the later, simply mention the attributes identifying the group.
For example:
FIT = [
"R2_A, NUC->G23N, B0->800.13MHz, T->23C",
"R1_A, B0->800.13MHz",
"DW_AB, NUC->N",
"R2_B",
]
In this example:
"R2_A, NUC->G23N, B0->800.13MHz, T->23C"
corresponds to the amide nitrogen R2 of state A of Gly23 measured at 23 ºC, 800.13 MHz."R1_A, B0->800.13MHz"
selects all the state A R1 values measured at 800.13 MHz, independently of the residue number and temperature."DW_AB, NUC->N"
corresponds to all the amide nitrogen chemical shift differences between states A and B ."R2_B"
selects all R2 values os state B.
FIX
Parameters in the FIX
list are fixed during the fitting process. The format is
similar to the FIT
list.
FIX = [
"R2_A, NUC->G23N, B0->800.13MHz, T->23C",
"R1_A, B0->800.13MHz",
"DW_AB, NUC->N",
"R2_B",
]
CONSTRAINTS
The CONSTRAINTS
list defines the constraints to be applied to the parameters.
Constraints are mathematical expression of other parameters. The value of the
constrained parameter is calculated using this expression.
Parameters in the mathematical expression given in the CONSTRAINTS
list should
be put in brackets.
CONSTRAINTS = [
"[R1_B] = 0.5 * [R1_A]",
"[R2_B, NUC->N] = [R2_A, NUC->N]",
]
Keys are read in that order: CONSTRAINTS
-> FIX
-> FIT
.
Selecting a subset of profiles
INCLUDE
The INCLUDE
key in method file allows selecting a subset of residues for
analysis during each fitting step. The residue name should match the spin-system
assignment provided in the experiment file(s). You can use the full spin-system
name (e.g. "G23N-H") or the group name (e.g. "G23") or the residue
number (e.g. 23). "ALL"
(or "*"
) is the default value, which indicates
that all residues are to be included in the current fitting step.
When only the residue number is used, use a list of integer, that is omit the quotes. These two formulations are equivalent:
INCLUDE = ["G2", "A4", "C5", "H6"]
INCLUDE = [2, 4, 5, 6]
EXCLUDE
The EXCLUDE
key in method file allows excluding a subset of residues from
analysis during each fitting step. its usage is similar to the INCLUDE
key.
Running a grid search
ChemEx has a built-in grid search method that offers the possibility to run an
nD grid search and plot the resulting χ2 values as 1D and 2D plots.
Grid search can be defined and run using the key GRID
in any section of the
method file.
The grid is defined on a parameters basis. The parameters defining the nD grid
are fixed to the value of the grid that is evaluated, while the other parameters
are set as defined by the FIX
, FIT
and CONSTRAINTS
options. Points of the
grid can be defined using a linear scale, a log scale or point by point:
-
Linear scale:
"[PB] = lin(<min>, <max>, <nb of points>)"
-
Log scale:
"[PB] = log(<min>, <max>, <nb of points>)"
-
Point by point:
"[PB] = (<value1>, <value2>, ..., <valuen>)"
Example:
GRID = [
"[PB] = log(0.03, 0.1, 10)",
"[KEX_AB] = log(200.0, 1000.0, 10)",
"[DW_AB] = lin(0.0, 10.0, 10)",
]
Parameters are selected as usual. For example, [R2_A]
would select all the
R2_A. [R2_A, NUC->G43N]
would select all the R2_A parameters of the
nucleus G43N (multiple values if there are several temperatures, B0 field).
At the end of the grid search the best point is selected and the corresponding
parameters are used in the next step of the fitting procedure. The χ2
values are reported in the grid.toml
file as well as in the form of 1D and 2D
plots.
For an nD grid with n > 1, multiple 1D grids are plotted, one per parameter, corresponding to the parameter values on the x axis and the minimum χ2 value along the other dimension.
Similarly, if n > 2, then a series of 2D χ2 surface corresponding to each pair of independent parameters is plotted. 2D surface correspond to 2D projection in which the best fit values in the other dimensions are used to evaluate each point.
When two parameters are independent from each others, the corresponding χ2 surface is entirely flat. It then becomes possible to define sub-grids of parameters that are all dependent on each others. The algorithm used in ChemEx starts by defining these minimal individual grids and then evaluates them separately. The evaluated sub-grids may share common parameters. They are therefore combined together at the end to recover the global minimum and to plot 1D χ2 curve for each parameter and 2D plots when possible.
This algorithm is much faster in the sense that a search involving 2 global parameters and 8 independent residue specific parameters does not generate a 10-dimensional grid, but 10 3-dimensional grids, which is much faster and importantly retains all the information.
Evaluating the uncertainty on the fitted parameters
The uncertainty on fitted parameters is, in general, estimated through the covariance matrix obtained from the Levenberg-Marquardt optimization. However, ChemEx offers additional methods to evaluate the parameter uncertainties, that is, Monte Carlo simulation, bootstrap analysis and nucleus-specific bootstrap.
Monte Carlo simulations
For the Monte Carlo simulation, the fit is run once and Gaussian noise is added to the back-calculated values based on the error. Fits are subsequently run on these generated profiles. After N simulations, the distribution of the fitted parameters provides an estimate of the uncertainty on the fitted parameters.
Bootstrap analysis
The bootstrap analysis is similar to the Monte Carlo simulations, except that the synthetic profiles are realized by randomly picking data points from each profile to generate new ones with the same number of points as the original.
Nucleus-specific bootstrap analysis
For the nucleus-specific bootstrap analysis, full profiles are randomly selected based on their associated nucleus to generate the synthetic datasets. In other words, if we have a dataset that depends on the nuclei {G2N, H8N, R9N, R9H}, potential new datasets could include the profiles of the following sets of nuclei {H8N, H8N, R9N, R9H} or {G2N, G2N, G2N, R9H}.
Contrary to standard bootstrap analysis, nucleus-specific bootstrap analysis can produce datasets with different number of data points in them. For examples, if profiles depending on H8N appears in multiple experiments and the ones depending on G2N in only one, then the two datasets mentioned above would have different number of data points. This goes against the main principle underlying bootstrap analysis that normally requires that all the newly sampled datasets are of the same size.
Syntax
These calculations can be run at the end of any fitting step by using the key
STATISTICS
.
The syntax is the following:
[STEP1]
STATISTICS = {"MC"= 100}
where MC is the type of simulation and 100 is the number of simulations.
Types can be:
- "MC" for Monte Carlo
- "BS" for bootstrap
- "BSN" for nucleus-specific bootstrap
To run two or more types of simulation just add additional pairs of values:
[STEP1]
STATISTICS = {"MC"= 100, "BS"= 100}
The output for each kind of simulation is stored in a single file stored in the directory corresponding to the step it belongs to. Parameter values are stored in different columns. When no values are available, which can be the case for nucleus-specific bootstrap analysis, the characters "--" are used to fill the space.