NGSchool 2025 - summary

M1

Briefly describe your up-to-date scientific interests and achievements

e.g. study programs completed, scientific fields you have worked in, work-related achievements. Every achievement is valid.

Characters: 0/1000

M2

Briefly describe your skills relevant to the topic of summer school and the computational biology field. Which of those skills would you like to improve during the NGSchool2025? What other skills would you like to acquire and how can NGSchool help you with it?

e.g. programming proficiency, used bioinformatic tools, types of analyses

Characters: 0/1000

M3

Describe one project you participated in, along with its practical implications for the research field. What was your role? Which challenges (big and small) did you have to overcome and how did you tackle them?

If you are describing your first project and it seems easy and not very impressive, do not worry. As long as you have worked on the problem and learned something, it is important and relevant for this section.

Characters: 0/1000

M4

Besides learning domain-specific skills, how will you benefit from attending the school? How will you use and share what you have learned at NGSchool2025?

We want to foster collaboration and maximize the impact and reach of NGSchool, please tell us how you plan to use your gained knowledge and how you will share it with more people.

Characters: 0/1000

B1

You performed bulk transcriptomic sequencing of mouse blood samples. However, when you tried to align the reads to a reference genome, you found: 0% uniquely mapped reads, 10% multi-mapped reads, and 90% unmapped reads. Which of the following options cannot explain the failure in mapping:

Samples have been contaminated

Adapter trimming was not performed

You did not deplete the ribosomal RNA

You are aligning with the wrong reference genome

The sequencing quality was poor

B2

In this 2013 paper, Christenson et al. investigated the role of miR-638 microRNA in Chronic Obstructive Pulmonary Disease (COPD). The figure below is an ECDF plot that shows the effect of miR-638 inhibition on COPD fibroblast gene expression. Based on the plot, which of the following statement(s) is/are correct?

On average, miR-638 targets are upregulated in the transfected COPD fibroblasts compared to the set of all genes

On average, miR-638 targets are downregulated in the transfected COPD macrophages compared to the set of all genes

On average, miR-638 targets are downregulated in the transfected COPD fibroblasts compared to the set of all genes

miR-638 activity downregulates target genes in COPD fibroblasts

miR-638 activity upregulates target genes in COPD fibroblasts

B3

You are analyzing genomic data from an organism with an unknown reference. Which of the following genome assembly strategies would be the most appropriate in this case?

De novo assembly using high-quality short reads only

De novo assembly using long or ultra-long reads only

Hybrid de novo assembly using short and long reads

Alignment to a closely related reference genome

A metagenomics-like assembly approach (utilizing a database of genomes)

B4

Ziff et al., 2023 describes the transcriptional landscape of Amyotrophic Lateral Sclerosis (ALS), a disease that causes motor neuron loss and often involves TDP-43 protein abnormalities. The analysis involved transcriptomic profiling of in vitro motor neuron samples generated using induced pluripotent stem cells derived from ALS patients and non-ALS controls (CTRL). Based on the volcano plot below, which of the following statements is/are false?

TCEAL6 expression is higher in CTRL

CDKN1A is up-regulated in ALS compared to CTRL

The p-value for CHRND differential gene expression is lower than that of MYOG

The p-value cutoff for differentially expressed genes in this plot is 4

The p-value cutoff for differentially expressed genes in this plot is 1.3

B5

To study the role of Dlx5 in mouse pup development, you used CRISPR-Cas9 technology to knock out that gene. CRISPR-Cas9 looked for a place in the genome that should be targeted, and Cas9 introduced a break. What are the primary DNA repair mechanisms the cell uses after CRISPR-Cas9’s activity, taking into account that it is a transcription factor that requires the 5'-TAATTA-3' consensus sequence for DNA binding?

base excision repair

mismatch repair

non-homologous recombination

microhomology-mediated end joining

S1

Last year, 300 people applied for NGSchool. Their mean age was 27.2 (with a standard deviation of 4.5) years and their registration scores were normally distributed with a mean of 80 and a variance of 20. The easiest 6 questions were answered correctly by 100% participants and the hardest one was answered correctly by only one participant. Assuming that scoring is the same this year, what is the probability of receiving a score higher than 90?

*This is not the real participant data, a creative license was applied to design the above question.

0.500

0.308

0.022

0.013

S2

While studying the effects of a new drug on immune cell function, you measured the proliferation rate of T-cells in the presence and absence of the drug. Which of the following statistical tests would be most appropriate for analyzing your data?

T-test

ANOVA

Chi-square test

Correlation analysis

S3

A researcher developed a logistic regression model that predicted the rate of lung fibrosis after chest radiotherapy based on patient sex, the percentage of lung tissue exposed to >20 Gy and single nucleotide polymorphisms in genes encoding for metalloproteinases. She validated the model in an independent population and found that sensitivity decreased from 98% to 94% and specificity decreased from 73% to 68% relative to performance on the initial study population. How can she assess model calibration in this scenario?

Using ROC curves

Based on the Akaike information criterion

By plotting the relation between the estimated risk and the observed proportion of events

By dividing the data into training, testing, and validation sets, but only using the first two during the model-building phase

S4

The prevalence of a metabolic disease in the population is 1 in 40,000. You designed a diagnostic method based on assessing the concentration of amino acid X in the spinal fluid using a cutoff value of 22 μmol/L. After running some tests, you determine that while using your new method 95% of tested healthy individuals receive a negative diagnosis, but 2% of people having the disease are incorrectly diagnosed. Assuming your method were to be used as a mass screening tool across a hypothetical population of 2 million individuals (with general health metrics in line with national averages for metabolic markers and distribution of ages reflecting a typical urban population), calculate the approximate number of expected false positives.

2000

100 000

40 000

800

S5

Which of the following statement(s) is/are correct?

The Markov chain on the left has a steady state distribution

The Markov chain on the left has at least one absorbing state

The Markov chain on the left has at least one dispersive state

The Markov chain on the right has a steady state distribution

The Markov chain on the right has at least one absorbing state

The Markov chain on the right has at least one dispersive state

S6

For the Markov chain on the right, what is the probability that if we start in state 1, we return there in exactly three steps?

Please use the decimal format (e.g. 0.12, 0.99)

S7a

Which of the illustrated hazard rates (alpha, beta, and gamma) corresponds to each of the survival functions a, b and c?

a: alpha, b: beta, c: gamma

a: alpha, b: gamma, c: beta

a: beta, b: alpha, c: gamma

a: beta, b: gamma, c: alpha

a: gamma, b: alpha, c: beta

a: gamma, b: beta, c: alpha

S7b

Looking at the hazard rates, which profile suits the moniker “memoryless” the best?

Alpha

Beta

Gamma

C1

Given an input text file Download, parse the records to find a number made when the first digit and the last digit in each line is combined (in that order). What is the sum of all the numbers? Note that the same digit could be the first and the last digit in a line.

For example, if the lines are:

eightg1

4ninejfpd1jmmnnzjdtk5sjfttvgtdqspvmnhfbm

78seven8

6pcrrqgbzcspbd

The numbers for each line are 11, 45, 78 and 66 respectively. The sum of all the numbers would be 200.

Characters: 0/64

C2

Reading any number from left to right, if all the digits are in increasing order (for example 134468) we can call such a number an “increasing number”. Similarly, if all the digits are in decreasing order (for example 66420) we can call it a “decreasing number”. For this question, we shall call a positive integer that is neither increasing nor decreasing (for example 155349) a “bouncy number”. Clearly, there cannot be any bouncy numbers below one hundred, but just over half of the numbers below one thousand (525 exactly) are bouncy. In fact, the lowest number for which the proportion of bouncy numbers first reaches 50% is 538. Surprisingly, bouncy numbers become more and more common and by the time we reach 21780 the proportion of bouncy numbers is equal to 90%. Find the lowest number for which the proportion of bouncy numbers is exactly 99%.

Characters: 0/64

C3a

How many different peptide sequences can be created?

C3b

What is the length of the shortest peptide sequence?

C3c

What is the 2nd amino acid of the longest peptide? Provide a one-letter answer.

C4

The efficient parsing of biological databases is an essential skill for computational biologists. Use any approach, to retrieve information about the human gene located on chromosome 17, with genomic coordinates (in GRCh38.p13 reference genome) of chr17:60,149,942-60,179,021. This gene is translated to a protein involved in many pathological and physiological responses in human diseases. However, these processes are mostly studied using mouse models. Which of the below statement(s) is/are correct?

The human gene contains 10 exons

The human-mouse orthology type is “One to One”

The symbol for the mouse gene is also Ca4

The mouse gene and the human gene contain 8 exons each

The name for the corresponding human protein is CAH4

The protein transcribed in humans is working as a transcription factor

BU1

Travel grant justification

Characters: 0/500

BU2

Fee waiver justification

Characters: 0/500

Summary

Introduction

Personal

Motivation

Briefly describe your up-to-date scientific interests and achievements

Briefly describe your skills relevant to the topic of summer school and the computational biology field. Which of those skills would you like to improve during the NGSchool2025? What other skills would you like to acquire and how can NGSchool help you with it?

Describe one project you participated in, along with its practical implications for the research field. What was your role? Which challenges (big and small) did you have to overcome and how did you tackle them?

Besides learning domain-specific skills, how will you benefit from attending the school? How will you use and share what you have learned at NGSchool2025?

Bioinformatics

You performed bulk transcriptomic sequencing of mouse blood samples. However, when you tried to align the reads to a reference genome, you found: 0% uniquely mapped reads, 10% multi-mapped reads, and 90% unmapped reads. Which of the following options cannot explain the failure in mapping:

You are analyzing genomic data from an organism with an unknown reference. Which of the following genome assembly strategies would be the most appropriate in this case?

Statistics

While studying the effects of a new drug on immune cell function, you measured the proliferation rate of T-cells in the presence and absence of the drug. Which of the following statistical tests would be most appropriate for analyzing your data?

Which of the following statement(s) is/are correct?

For the Markov chain on the right, what is the probability that if we start in state 1, we return there in exactly three steps?

Which of the illustrated hazard rates (alpha, beta, and gamma) corresponds to each of the survival functions a, b and c?

Looking at the hazard rates, which profile suits the moniker “memoryless” the best?

Coding

Given an input text file Download, parse the records to find a number made when the first digit and the last digit in each line is combined (in that order). What is the sum of all the numbers? Note that the same digit could be the first and the last digit in a line.

How many different peptide sequences can be created?

What is the length of the shortest peptide sequence?

What is the 2nd amino acid of the longest peptide? Provide a one-letter answer.

Bursary

Travel grant justification

Fee waiver justification

I have read the Personal Data Processing and Storage and I agree to have my data processed and stored by the NGSchool Society.	NO
I have read the Code of Conduct and declare I will follow it.	NO