Automated Model Building

This is the main module of ARP/wARP, which provides execution of the following tasks:

(a) automated model building starting from experimental phases

(b) automated model building starting from existing model

(c) improvement of maps by atoms update and refinement

(d) building of solvent atoms

Applications (a) and (b) (so called warpNtrace protocol) start with input experimental / density modified phases or available (preliminary refined or partially autotraced) model and are aimed to deliver an essentially complete model and obviously an improved map. The software used by these applications has considerably advanced from the previous 6.0 version so that the task now converges faster, may be applicable to lower resolution of the X-ray data and may tolerate poorer starting phases. As a rule of thumb, the resolution of the data should be 2.7 or higher.

warpNtrace protocol utilises the idea of the hybrid model in which protein and free atoms can co-exist. warpNtrace keeps whatever was recognised as protein (in a form of polypeptide fragments) and the rest as free atoms and refines this hybrid model during a 'big' cycle, consisting of several (typically 5) ARP/REFMAC update/refinement cycles. At the end of each big cycle the map is interpreted anew and this is expected to provide a better interpretation (more residues in less fragments). This whole procedure is iterated (typically 10 times).

The output of warpNtrace is a set of refined polypeptides fragments. If the sequence is available, the traced fragments will be docked in sequence and side chains will be built during the iterative refinement procedure. After the last building cycle the fragments will be arranged to form a globular structure (or, for a case of NCS, several NCS-related structures). The remainder of the structure (cis-prolines, poorly ordered loops and terminal residues for each fragment) will have to be completed by the user. Since the output model is refined, its accuracy is expectedc to be comparable to the one of the final refined structure. Mis-tracing (incorrect tracing of polypeptide fragments) is not impossible but should not normally exceed 1 % of the whole structure (this is very much subject to the resolution and quality of the data, quality of starting phases and the level of convergence of the warpNtrace task).

Application (c) has not changed since the previous 6.0 release. It can be used if warpNtrace was unsuccessful and may provide improvement in density map. The map is first interpreted as a pseudo protein model, consisted of unconnected free atoms (similar to the map interpretation in application (a)). This model is then refined and updated with iterative cycles of ARP/REFMAC. However, no autotracing (interpretation of the map in terms of polypeptide fragments as in warpNtrace) is carried out.

Application (d) for building a solvent structure into a model where the protein part is complete has also not changed since the previous 6.0 release. Within this task restrained reciprocal space refinement is carried out with REFMAC while ARP/wARP is performing automatic adjustment of the solvent structure. Resolution of the data should be 2.5 or higher. The output is the protein model with the solvent molecules transformed with symmetry operations to lie around the protein.

Below is the application (a) is described in detail, input to applications (b), (c) and (d) is very similar and should be obvious.

 

 

o               Run ARP/wARP for Choose applications (a) to (d) as described above.

o               Dock the autotraced chains to sequence The default is to dock the fragments starting from building cycle 0. The cycle number can be changed, although this should not be advantageous. Should the sequence not be available, the docking can be disabled by clicking on the check box on the left.

o               MTZ in X-ray data in the MTZ format containing structure factor amplitudes, their standard deviations, phases and figures of merit. If pre-weighted structure factor amplitudes (e.g. from SHARP) are to be used to construct initial map, please check the corresponding box in ARP/wARP flow parameters (see below).

o               Fobs Sigma PHIB FOM If the MTZ column labels for structure factor amplitudes, their standard deviations, phases and figures of merit have obvious names, they will be recognised automatically. Otherwise please use the scrolling button, navigate to List All Labels and chose appropriate ones.

o               Sequence file in Provide the sequence file in the following format (pir):

The first line should start with >

The second line should be blank

The sequence (1 letter code) starts from the third line. The spaces hereafter are ignored.

o               Total residues in the AU / number of molecules For monomers provide the total number of residues in the asymmetric unit, the number of molecules is obviously 1. In a case of NCS, please also provide the total (!) number of residues in the asymmetric unit and the number of NCS related molecules (e.g. if you have 2 molecules in the AU with 200 residues each, enter 400 for the number of residues). If you have a heteromer, e.g. 3a/3b structure, the NCS order is 3 but please make sure that the sequence file contains both sequences separated by about 20 alanines:

SEQUENCE_OF_a_SUBUNIT_AAAAAAAAAAAAAAAAAAAA_SEQUENCE_OF_b_SUBUNIT

o               Cycles of autobuilding / total cycles The default is 10 big building cycles separated with 5 ARP/REFMAC cycles (thus making 50 cycles in total). In cases of good starting phases the autobuilding may converge faster, in cases of poorer phases more cycles may be required. You can always submit warpNtrace for further cycles using the output of the previous tracing (application automated model building starting from existing model).

o               Protocol for REFMAC5 / Rfree The fast and slow protocols differ in the number of internal Refmac cycles and the dumping factors. The type of the protocol will be set automatically judging from the resolution of the X-ray data. Usually there is no need to change it. For warpNtrace task the default is to not use Rfree, since the number of traced residues serves as excellent indicator of the success of the job. You can turn the use of Rfree on but the authors have seen marginal cases (low resolution and hence low observation-to-parameter ratio) when this adversely affected the tracing.

 

 

There is a number of additional parameters that you normally should not worry about. Brief description is given below

 

o               Pre-weighted Fobs for initial map calculation (e.g. from SHARP). Checking this box will result in a pool-down menu asking for FBEST label.

o               Number of ARP/REFMAC refinement cycles between autobuilding The default is 5 cycles. In cases of poor convergence you can try to increase this number to 10.

o               Skip the autobuilding for the first cycles Checking this box will disable the autotracing for the provided number of cycles. This was sometimes advantageous with previous version 6.0 when the initial phases were poor. The default is to start autotracing from cycle 0.

o               Randomisation of atomic positions This was sometimes advantageous with previous version 6.0 when the initial bias was high. The default is not to randomise.

o               Truncate excessive shifts This is a leftover from earlier version, ignore this parameter.

o               Removal of protein atoms of traced model During the ARP/REFMAC cycles in between the tracing, the hybrid model is updated. If you would like to keep track on what part of traced fragments has been removed during the update, then check the box. This option is provided primarily for developers only.

o               Iterate the tracing Each main chain tracing is carried out in several iterations. The module will decide on its own how many iterations is needed. The default maximum number is 5 and it is NOT recommended to change this value.

o               Density thresholds for atom removal and addition These parameters are defined automatically on the basis of the resolution of X-ray data. In cases of poor convergence, particularly when the number of both added and removed atoms is considerably less than the number requested (as can be seen from the log file), the threshold for atoms removal can be slightly increased. This option is provided primarily for developers only.

o               Increase in the number of atoms to be added and removed as compared to the automatically set values The default is 1 (no increase) and it is not recommended to change this parameters. This option is provided primarily for developers only.

 

o               Cycles of refinement in each Refmac run Refmac is invoked to refine the hybrid model before the density maps are computed. The default is 1 cycle for the fast protocol and 3 cycles for the slow protocol, see above. There is usually no need to change these parameters.

o               Damp shifts The default is 0.99. There is usually no need to change these parameters.

o               Matrix weight for Xray / Geometry The default is automatic weighting. This proved to work well and, probably, there is no need to change this parameter.

o               Scaling model The default is to use solvent correction for scaling low angle part of the X-ray data. You can turn this off (chose simple solvent correct) if your low angle data are missing (e.g. your data have about 8 low resolution cutoff) or they suffer from missing overloaded reflections. XXX check XXX

o               Scaling B factor The default is to use anisotropic B factor for scaling the X-ray data. You can turn this off (chose isotropic scaling B factor) if your data are systematically incomplete (e.g. a cone is missing in reciprocal space).

o               Data with free R label This parameter appears if the free R flag is chosen for refinement of the protein part of the model. Here you can provide a column label for the free R flag.

o               Use of free R reflections This parameter appears if the free R flag is chosen for refinement of the protein part of the model. The scaling and calculation of sA coefficients by Refmac map can be computed on the bases of the free reflections (this is the default) or using all reflections.

o               Solvent mask correction The default is to use solvent mask correction within Refmac.

o               TLS refinement The default is not to do a TLS refinement of a hybrid model.

 

o               Space group This is derived automatically from the MTZ file, is displayed for information only and cannot be changed.

o               Cell This is derived automatically from the MTZ file, is displayed for information only and cannot be changed.

o               Wilson B factor This is derived automatically from the MTZ file, is displayed for information only and cannot be changed.

o               Solvent content This is derived automatically from the MTZ file, is displayed for information only and cannot be changed. However, you may want to check this number whether it conforms to your expectations.

o               Resolution By default all reflections present in the MTZ file will be used. You can check the box and then narrow the range if you are aware of certain deficiencies of your data.

 

o               Checking in this button will activate remote submission. This is described below in a separate chapter of this document.

 


  • OUTPUT files, short Log File:

o               Had to go as low as XXX sigma to complete atoms search The initial free-atoms model is built into the starting density map. The density threshold is successively reduced. A typical value that you can see in the log file is between 0.3 and 0.6 sigma. A lower value may be an indication of too-much flattened map or an overestimation of the number of residues in the asymmetric unit. If you suspect the latter, please check the derived solvent content in the GUI window.

o               Building cycle zero Normally one should expect a considerable part of the structure to be built already at the starting building cycle zero. If this is not the case, observe the situation for a few further building cycles. If, however, there is essentially nothing autotraced for 10 building cycles, please inspect whether the initial phases are sufficiently good.

o               Rounds within building cycle As was mentioned above, each cycle of the the main chain tracing is carried out in several rounds. Normally each successive round should result in more residues and in fewer fragments. The maximum length of the traced fragment is also printed for information.

o               Chains, residues and connectivity index The output from the best tracing round is processed further. Terminal residues are removed and the fragments of 3 peptides or shorter are converted back to free atoms. The rest is kept and used to provide restraints for subsequent ARP/REFMAC cycles. The value of the connectivity index should increase steadily if the tracing is successful. A value below 0.6 is not very promising. A value around 0.8 indicates a good progress. A value above 0.95 indicates an essentially complete tracing.

o               Residues docked into sequence If the sequence was provided, the autotraced fragments are docked into it and the side chains are built and refined in real space. The results of this are printed out.

o               R factor from Refmac The value of the R factor typically oscillates. It goes up after each tracing cycle (because the model is entirely rebuilt) and then decreases during the ARP/REFMAC refinement cycles. At the end of the procedure it should reach a value typical for a restrained refinement.

o               Sequence coverage This is defined as the ratio between the number of docked residues (if sequence is provided) and the total number of traced residues. . A value higher than 0.8 is deemed as good convergence. All free (dummy) atoms are removed from the file and the task moves into a few cycles of restrained refinement with solvent search. If, however, the value of sequence coverage is lower than 0.8, the free atoms are left in the file. You can inspect the density maps, start changing the model on the graphics or, alternatively, submit another warpNtrace task using the output of this job.

 

o               CPU requirements Execution of the autotracing task is time consuming. Using a standard protocol of 10 building cycles interspaced with 5 ARP/REFMAC cycles, one should expect a job for a structure of 200 residues to be completed within 1 hour (subject to the power of the computer you are using).

 

o               Job termination

The statement Task completed successfully indicates that the job is finished with no error. An error statement

QUITTING PROGRAM TO BLAME: name_of_the_programme

indicated that one of the modules of the task has terminated with an error message. You will also be referred to the specific log file.

 


Running model building from command line, auto_tracing.sh

The script file auto_tracing.sh in the $warpbin directory allows one to run the automated model building from a command line without the use of the GUI. The use of auto_tracing.sh is fairly simple. If invoked without arguments the script will print help information.

Required keywords are: datafile (followed by the mtz-file name with the absolute path) and residues (followed by the number of residues).

Optional keywords include: workdir (followed by the absolute path to the working directory), fp (followed by the fp label), sigfp (followed by the sigfp label), freelabin (followed by the Rfree label), fbest (followed by the label for the fom-weighted structure factor amplitudes to be used for initial map calculation), phibest (followed by the best phi label), fom (followed by the figure of merit label), modelin (followed by a starting pdb-file with the absolute path), seqin (followed by a sequence-file name with the absolute path), cgr (followed by a number of NSC-related copies), cycles (followed by the total number of cycles) and albe (followed by 1 if building of secondary structural elements is to be invoked before every model building cycle).

Example call (assumed to be started from workdir where test data should reside):

auto_tracing.sh \

     datafile {mtzfile}  \

     residues {number_of_residues_in_AU}                                       \

     [workdir {FULLPATH_WORKING_DIRECTORY}]                                    \

     [fp {fp_label}] [sigfp {sigfp_label}] [freelabin {freer_label}]           \

     [fbest {weighted_amplitude_label}] [phib {phib_label}] [fom {fom_label}]  \

     [modelin {input_PDB_file_to_use_as_initial_model}]                        \

     [seqin {sequence_file_for_one_NCS_copy}]                                  \

     [cgr {number_of_NCS_copies (if seqin is provided, default is 1) }]        \

     [cycles {the_total_number_of_cycles (default is 50) }]                    \

     [albe {1 to_always_invoke_albe, default is 0 for resol < 2.7A, else 1) }] \

     [parfile {parfilename_if_only_parfile_is_to_be_created}]                \

The script will then create a directory in the workdir whose name will be printed and where a parameter file will be created. The log files and additional output files as well as the building results can be found in the directory created by auto_tracing.sh.

 

Remote submission of model building task

This option offers you the following possibilities:

a) Your task will run using external computational facilities, where the CPU performance may be superior to your local installation.

b) You can be assured that the most recent working executables will be used should you have a problem with your local installation.

c) Should the task crash, an automatic notification will be forwarded to the ARP/wARP developers who can then promptly help you.

d) You can share the results of the completed task with other software developers

Clicking on the button with "Submit the job for remote execution at the Hamburg cluster" within the main ARP/wARP GUI panel allows one to execute an autotracing task remotely. The panel will expands and ask for an email address to be provided. Then choose from one of the options from the drop down menu to indicate how you would like your data to be handled. The options are:

a) the data must be kept confidential and deleted after the job has finished

b) the data can be made available to ARP/wARP developers

c) the data can be archived and made available to SPINE and BIOXHIT partners

d) the data can be archived and made available to any software developer that requests them

Needless to say, that the users will make an important contribution towards future software development if they decide to share their data and results of the autotracing job. Option (b) will only allow the data share to the ARP/wARP development team. Option (c) will further extend the share to any software developer world-wide.

Once the job has been submitted for remote execution (but not yet launched !), the GUI window will indicate that the job has finished. Please inspect the log file from the drop down menu option "View files from job" for further instructions. An email will be sent to you at the email address that you entered in the GUI window. Please follow the instructions in the email (http link, login and password) to actually launch the job at the Hamburg cluster. You can then monitor the log file in your browser window. As soon as the job is finished, you will be provided with a link to the results that you can then download.

Keep in mind that once the job is finished, your data will be kept for only a week. Make sure that you download your data within that time.

The remote job submission relies on the curl software installed at your site. Availability of curl is checked while installing ARP/wARP and a warning (and http link) are given if curl is not available.