Data Management and Visualization — Week 3
My name is Dipika Pawar and this blog is a part of Data Management and Visualization course on coursera. The submission of the assignments will be in the form of blogs. So I choose medium as my medium of assignment. Click here for previous reference Week1 & Week2.
Here is my codebook:
Research Question:
Is there any relation between life expectancy and alcohol consumption?
Is there any relation between Income and alcohol consumption?
Assignment 3:
Assignment 3 contains data management for respective topics. We have to make and implement decisions about data management for the variables. Data Management is a process to store, organize and maintaining the data. Security of data is very important to secure the data which is a part of Data Management.
After the video of this week, here are the steps:
Setting aside missing data → Coding and validate the data → Creating secondary variables → Grouping values within individual variable
Blog contains:
Removing missing data rows
Make group of income, alcohol consumption and life expectancy
Show frequency distribution before and after grouping
Data Dictionary:
Code:
Step 1: Importing Libraries
Step 2: Read the specific columns of the dataset and rename the columns
Output: What data variable contains
Step 3: convert arguments to a numeric types from .csv files
Step 4: Drop the missing data rows
Step 5: Display absolute and relative frequency
Output:
Step 6: Make groups in variables to understand such research question
Output: Change each variable’s values according to group numbers
Step 7: Frequency Distribution of new grouped data
=======================================
Whole code:
#Import the libraries
import pandas
import numpy
from collections import OrderedDict
from tabulate import tabulate, tabulate_formats
data = pandas.read_csv(‘CodebookGap.csv’, low_memory=False,
skip_blank_lines=True, usecols=[‘country’,’incomeperperson’, ‘alcconsumption’, ‘lifeexpectancy’])
data.columns=[‘country’, ‘income’, ‘alcohol’, ‘life’]
‘’’
# Variables Descriptions
alcohol = “2008 alcohol consumption per adult (liters, age 15+)”
income = “2010 Gross Domestic Product per capita in constant 2000 US$”
life = “2011 life expectancy at birth (years)”’’’
data.info()
for dt in (‘income’,’alcohol’,’life’) :
data[dt] = pandas.to_numeric(data[dt], errors=’coerce’)
data.info()
nullLabels =data[data.country.isnull() | data.income.isnull() | data.alcohol.isnull() | data.life.isnull()]
print(nullLabels)
data = data.dropna(axis=0, how=’any’)
print (data.info())
c2 = data[‘income’].value_counts(sort=False)
p2 = data[‘income’].value_counts(sort=False, normalize=True)
c3 = data[‘alcohol’].value_counts(sort=False)
p3 = data[‘alcohol’].value_counts(sort=False, normalize=True)
c4 = data[‘life’].value_counts(sort=False)
p4 = data[‘life’].value_counts(sort=False, normalize=True)
print(“****************************************************”)
print(“*************Absolute Frequency********************”)
print(“Income Per Person:”)
print(“Income Freq”)
print(c2)
print(“Alcohol Consumption:”)
print(“Alcohol Freq”)
print(c3)
print(“Life Expecectancy:”)
print(“Life Freq”)
print(c4)
print(“****************************************************”)
print(“*************Relative Frequency********************”)
print(“Income Per Person:”)
print(“Income Freq”)
print(p2)
print(“Alcohol Consumption:”)
print(“Alcohol Freq”)
print(p3)
print(“Life Expecectancy:”)
print(“Life Freq”)
print(p4)
minMax = OrderedDict()
dict1 = OrderedDict()
dict1[‘min’] = data.life.min()
dict1[‘max’] = data.life.max()
minMax[‘life’] = dict1
dict2 = OrderedDict()
dict2[‘min’] = data.income.min()
dict2[‘max’] = data.income.max()
minMax[‘income’] = dict2
dict3 = OrderedDict()
dict3[‘min’] = data.alcohol.min()
dict3[‘max’] = data.alcohol.max()
minMax[‘alcohol’] = dict3
df = pandas.DataFrame([minMax[‘income’],minMax[‘life’],minMax[‘alcohol’]], index = [‘Income’,’Life’,’Alcohol’])
print (df.sort_index(axis=1, ascending=False))
dummyData = data.copy()
# Maps
income_map = {1: ‘>=100 <5k’, 2: ‘>=5k <10k’, 3: ‘>=10k <20k’,
4: ‘>=20K < 30K’, 5: ‘>=30K <40K’, 6: ‘>=40K <50K’ }
life_map = {1: ‘>=40 <50’, 2: ‘>=50 <60’, 3: ‘>=60 <70’, 4: ‘>=70 <80’, 5: ‘>=80 <90’}
alcohol_map = {1: ‘>=0.5 <5’, 2: ‘>=5 <10’, 3: ‘>=10 <15’, 4: ‘>=15 <20’, 5: ‘>=20 <25’}
dummyData[‘income’] = pandas.cut(data.income,[100,5000,10000,20000,30000,40000,50000], labels=[‘1’,’2',’3',’4',’5',’6'])
print(dummyData.head(10))
dummyData[‘life’] = pandas.cut(data.life,[40,50,60,70,80,90], labels=[‘1’,’2',’3',’4',’5'])
print(dummyData.head(10))
dummyData[‘alcohol’] = pandas.cut(data.alcohol,[0.5,5,10,15,20,25], labels=[‘1’,’2',’3',’4',’5'])
print (dummyData.head(10))
c2 = dummyData[‘income’].value_counts(sort=False)
p2 = dummyData[‘income’].value_counts(sort=False, normalize=True)
c3 = dummyData[‘alcohol’].value_counts(sort=False)
p3 = dummyData[‘alcohol’].value_counts(sort=False, normalize=True)
c4 = dummyData[‘life’].value_counts(sort=False)
p4 = dummyData[‘life’].value_counts(sort=False, normalize=True)
print(“****************************************************”)
print(“*************Absolute Frequency********************”)
print(“Income Per Person:”)
print(“Income-Freq”)
print(c2)
print(“Alcohol Consumption:”)
print(“Alcohol-Freq”)
print(c3)
print(“Life Expecectancy:”)
print(“Life-Freq”)
print(c4)
print(“****************************************************”)
print(“*************Relative Frequency********************”)
print(“Income Per Person:”)
print(“Income- Freq”)
print(p2)
print(“Alcohol Consumption:”)
print(“Alcohol-Freq”)
print(p3)
print(“Life Expecectancy:”)
print(“Life-Freq”)
print(p4)