Background:
Given is the crude gene expression profile for 4 cell types involved in human plasma cell differentiation. The process includes 4 stages of development starting from memory B cells (MBC) to pre-plasmablasts (prePB), plasmablasts (PB) and plasma cells (PC).
Alex, a Scientist wants to understand the biological processes differentiating the 4 stages of plasma cell differentiation and designs an experiment to capture the bulk or total RNA-profile of 3 replicates of each cell type across the 4 stages through RNA-sequencing. Please help Alex understand the data generated from the experiment and the analysis steps to get to the identification of biological processes that differentiate the 4 cell types.
Part I: Data exploration and normalization
Q1: Alex wants to first have a feel of the data generated and wants to look at the size and composition of the data:
i) How many total genes have been captured in the experiment for the 4 different cell types? [HINT: Count the unique GENE_IDs]
ii) Alex knows that there are some gene names that are identical for a subset of the genes even when they have different gene IDs. Identify those and remove them from the data:
a) How many duplicate genes are present [HINT: Count the duplicated GENE_NAMEs]
b) List the top 5 gene types that have the maximum count of duplicated genes names
c) De-duplicate the dataset while retaining the gene names with highest Mean Average Deviation (MAD)
d) Index the resultant dataset by the unique gene names and save your output in a CSV file (retain only the columns GENE_NAME, GENE_TYPE and the data for the 4 cell types). Use this output for all further analyses.
iii) List the counts of different gene types present in the dataset and plot them to give a visual representation of the same.
Q2: Alex now wants to look at general statistical metrics and spread of the dataset
i) Look at the general distribution of the dataset per sample/replicate of the 4 cell types
ii) Identify the set of coding genes and comment on the distribution of the sub-set of coding genes compared to that of the whole dataset [HINT: Use visual representation like line plot or boxplot to comment]
iii) Save the expression profile of the coding genes in a separate CSV file. Use this output for all further analyses.
Q4: Are the scales of the expression profiles for different cell types in the similar range? If not, normalize the data for downstream analysis. You could use either standard scaling (z-score transformation) or min-max scaling. Also, explain why scaling is useful. [HINT: Redo the plot(s) created in Q2 after normalization and compare your results]
Part II: Statistical testing
Q1: Differential expression and hypothesis testing
After data normalization, Alex would like to first identify genes/features that differentiate between the fist and last cell type stages, i.e., between ‘MBC’ and ‘PC’ cell types. Given the large number of features in this dataset, Alex needs a statistical method to identify useful patterns across cell types. Alex knows that hypothesis testing tools like t-test can be used to identify the important features (genes) for binary classification (into 'MBC' or 'PC' in this case).
Perform t-tests for each feature to check if the difference between the means of the two cell types (MBC and PC) are significantly different from each other. [Use significance level (alpha) = 0.05]
i) Compile the t-test results from all the features in the form of a table with 3 columns - feature name, t-statistic and p-value.
ii) Reorder the rows of the table by increasing p-values. How many features do you identify as being significant by the individual t-tests done on each of them?
Q2: Extract the top 100 significant genes identified above and create a heatmap including all the 4 cell types [HINT: scale your heatmap, if required]. Also, do k-means clustering on the rows and rearrange them so that features from the same group lie together. What do you infer from the heatmap and the clustering results?
Q3: With the above method of hypothesis testing, what is the probability that at least one feature in this dataset will be identified as significant by random occurrence? Perform multiple hypothesis correction for the dataset and compare the number of significant genes identified after correction to that you obtained before correction. [HINT: Use Bonferroni or BH correction]. Interpret your results.
Q4: Perform dimensionality reduction on this dataset with PCA.
i) How many principal components are needed to explain 95% variance in the data?
ii) Generate PCA scores plot with PC1 on x-axis and PC2 on y-axis and label the observations with ‘cell_type’.
iii) What do you infer from the PCA plot? Explain your findings.
CS 340 Milestone One Guidelines and Rubric Overview: For this assignment, you will implement the fundamental operations of create, read, update,
Retail Transaction Programming Project Project Requirements: Develop a program to emulate a purchase transaction at a retail store. This
7COM1028 Secure Systems Programming Referral Coursework: Secure
Create a GUI program that:Accepts the following from a user:Item NameItem QuantityItem PriceAllows the user to create a file to store the sales receip
CS 340 Final Project Guidelines and Rubric Overview The final project will encompass developing a web service using a software stack and impleme