导图社区 Reading 6- organizing, visualising, and describing data
这是一篇关于Reading 6- organizing, visualising, and describing data的思维导图。
编辑于2022-06-07 11:45:04organizing, visualising, and describing data
data types and data organization
numerical vs. categorical
numerical data
continuous data
has infinite number of data
discrete data
has finite number of data
categorical data
nominal data
ordinal data
labels ranked in logical orders
structured vs. unstructured
the process of transforming the unstructured data into structured data - financial modeling
structured data - cross sectional data, time series data, market data, analytical data, fundamental data, etc.
unstructured data - social media, corporate filings, traffic sensors, etc.
cross sectional vs. time series
cross sectional data
comparable observation all happened at a specific time
time series data
a set of observations taken place according to a set of periods of time
summarizing and visualizing data
population vs. sample
population: the set of all possible members of group of interests
sample: a subset of population
sample statistics: extract the parameters from the sample
descriptive statistics
frequency distribution
when illustrating one single variable
types of frequency distribution
absolute frequency
relative frequency
cumulative relative frequency distribution
diagrams:
histogram
frequency polygon - with line connection
bar chart
the steps of presenting its frequency distribution
find the minimum and maximum of the data
must cover all the intervals
don't want too many intervals (too disperse without seeing the trend)
don't want to too few intervals (too broadly summarized)
5-10 intervals will be good
contingency table (two-way table)
marginal frequency vs. joint frequency
marginal frequency: total sum on the last row or last column
diagrams:
tree map
heat map
word cloud
visualizing the frequency of unstructured texts
analyzing and visualizing the relationships between two variables
scatter plot
identify the relationship between two variables
matrix scatter plot
identify the relationship more than two variables
inferential statistics
infer from sample to population
measures of central tendency
central tendency indicates RETURN
mode
it's determined the highest frequency of data
median
it's determined by the location of a sequence of data
Quartile / quintile / decile / percentage
it's determined by the location as well (25%, 20%, 10%,1%)
mean
population vs. sample mean
arithmetic mean
weighted mean
harmonic mean
used to calculate the average cost of shares purchased over a period of time
geometric mean
used to calculate the growth rate of the variable
trimmed mean
discard a percentage of extreme observations
winsorised mean
replace the extreme observations with substituted values
measures of dispersion
dispersion indicates RISKS
range
mean absolute deviation
MAD=(∑▒〖|X−x ̅|〗)/N
variance
variance is the square standard deviation
standard deviation
the standard deviation is the standardized variance
what are the steps of calculating the standard deviations
calculate the sample mean
calculate the deviations from the sample mean
square every deviations
sum all of the square of every deviations
divide by number minus 1
apply square root
population variance vs. sample varianace
coefficient of variation
Standard Deviation / Mean
it measures the risk per unit return
Sharpe ratio
(return rate - risk-free return rate) / Standard Deviation
it measures the return per risk
Target Downside Deviation
it denotes a deviation towards a specific target (B) but only considers the negative deviations
Skewness and Kurtosis in Returns Distributions
normal distribution
1 median = mean
2 the distribution can be completely described by its mean and variance
3 1 standard deviation = 68%, 2 standard deviation = 95%, 3 standard deviation = 99%
4 the distribution is symmetric by its mean
skewed distribution
positively skewed distribution
skewed on its right tail
extreme outliers on the right side, mean > median > mode
negatively skewed distribution
skewed on its left tail
extreme outliers on the left side, mean < media < mode
sample skewness
positive skewness, Sk > 0
negative skewness, Sk < 0
the significant skewness |Sk| > 0.5
kurtosis distribution
more peak than a normal distribution - Leptokurtic distribution
it has longer tails than normal distribution - higher risks
flatter than a normal distribution - Platykurtic distribution
covariance and correlation
it measures linear relationship between two variables
variance vs. covariance
variance: S_x= (∑▒〖(x−(x)) ̅〗^2 )/(n−1)
covariance = (∑▒〖(x−x ̅)(y−y ̅)〗)/(n−1)
normalized covariance = correlation coefficient
r_xy=S_xy/(S_x S_y )
---> +1 perfect linear relationship
---> -1 perfect inverse relationship
limitations of correlation analysis
sensitive to outliers
extreme values will influence the amount of variance, covariance, and correlation efficient
spurious correlation
no logical relationship between the two variables although they are statistically significant
correlation does not imply causation