glof fullpath: TSV file with gene ID, start bp, end bp, chromosome / scaffold, strand (encoded as 1 or -1). This is used to map information regarding transcripts (currently not used) and proteins (used for MCL clustering, potentially also used in PGDBs) to the genes (e.g. gtpf fullpath: TSV file with a header describing gene, transcript, and protein identifier and then for all the identfiers the listings (gene-ID, transcript-ID, protein-ID).
Can only contain Protein IDs as header e.g. psf fullpath: Protein sequence file of the genome. md list: A list of Metabolic domains that should be analyzed. Check if file is outdated (needs to contain all reactions that are present in the PGDB that have genes annotated to). rmdf fullpath: TSV file of the metabolic domains of reactions.
pgdb fullpath: Give full path of PGDB flat file folder, where the flat files of the species pgdb is stored. Mandatory inputs: all files need full path, order does not matter. In this version we provide the option of several parameters to prevent clusters from spanning such large gene poor regions. Instead by default we use MaxSeqGapSize set to 100000 and MaxInterGeneDistByMedian set to 50 resulting in similar cluster predictions as in PCF version 1.0.ģ) Large gene poor intergenic regions are present in genomes. We also provide the option to NOT insert any hypothetical genes all together.
Thus here we changed the code to insert 2 hypothetical genes only if a strech of unknown sequence is larger than nth percentile of gene sequences (set to 5). It is unlikely that missing information about a single nucleotide would (if it would be known) lead to the finding of multiple gene models. This led sometimes to unrealistic prevention of detecting gene clusters. Previously, any intergenic region affected by at least one N was evaluated for its length, and hypothetical genes were inserted accordingly (See Schlapfer et al, PMID:28228535). In version 1.3 we identify these breaks (but no longer insert 20 hypo genes) and prevent formation of a cluster over these gaps.Ģ) Any sequencing information that is missing is typically hard masked with Ns. This however diluted the background of low quality genomes with non-enzymes, and hence the likelyhood of a cluster to be classified as top x% of enzyme dense regions was better than in a genome that had good quality. Previously we inserted 20 hypothetical genes at each break. 1) Physical breaks of the genome or sequencing gaps of unknown size are typically encoded by stretches of Ns in the genome assembly fasta file.