My name is Dipika Pawar and this blog is a part of Data Management and Visualization course on coursera. The submission of the assignments will be in the form of blogs. So I choose medium as my medium of assignment. Click here for previous reference Week1 & Week2.

Here is my codebook:

Research Question:

Is there any relation between life expectancy and alcohol consumption?

Is there any relation between Income and alcohol consumption?

Assignment 3:

Assignment 3 contains data management for respective topics. We have to make and implement decisions about data management for the variables. Data Management is a process to store, organize and maintaining the data. Security of data is very important to secure the data which is a part of Data Management.

After the video of this week, here are the steps:

Setting aside missing data → Coding and validate the data → Creating secondary variables → Grouping values within individual variable

Blog contains:

Removing missing data rows

Make group of income, alcohol consumption and life expectancy

Show frequency distribution before and after grouping

Data Dictionary:

Code:

Step 1: Importing Libraries

Step 2: Read the specific columns of the dataset and rename the columns

Output: What data variable contains

Step 3: convert arguments to a numeric types from .csv files

Step 4: Drop the missing data rows

Step 5: Display absolute and relative frequency

Output:

Step 6: Make groups in variables to understand such research question

Output: Change each variable’s values according to group numbers

Step 7: Frequency Distribution of new grouped data

=======================================

Whole code:

#Import the libraries

import pandas

import numpy

from collections import OrderedDict

from tabulate import tabulate, tabulate_formats

data = pandas.read_csv(‘CodebookGap.csv’, low_memory=False,

skip_blank_lines=True, usecols=[‘country’,’incomeperperson’, ‘alcconsumption’, ‘lifeexpectancy’])

data.columns=[‘country’, ‘income’, ‘alcohol’, ‘life’]

‘’’

# Variables Descriptions

alcohol = “2008 alcohol consumption per adult (liters, age 15+)”

income = “2010 Gross Domestic Product per capita in constant 2000 US$”

life = “2011 life expectancy at birth (years)”’’’

data.info()

for dt in (‘income’,’alcohol’,’life’) :

data[dt] = pandas.to_numeric(data[dt], errors=’coerce’)

data.info()

nullLabels =data[data.country.isnull() | data.income.isnull() | data.alcohol.isnull() | data.life.isnull()]

print(nullLabels)

data = data.dropna(axis=0, how=’any’)

print (data.info())

c2 = data[‘income’].value_counts(sort=False)

p2 = data[‘income’].value_counts(sort=False, normalize=True)

c3 = data[‘alcohol’].value_counts(sort=False)

p3 = data[‘alcohol’].value_counts(sort=False, normalize=True)

c4 = data[‘life’].value_counts(sort=False)

p4 = data[‘life’].value_counts(sort=False, normalize=True)

print(“****************************************************”)

print(“*************Absolute Frequency********************”)

print(“Income Per Person:”)

print(“Income Freq”)

print(c2)

print(“Alcohol Consumption:”)

print(“Alcohol Freq”)

print(c3)

print(“Life Expecectancy:”)

print(“Life Freq”)

print(c4)

print(“****************************************************”)

print(“*************Relative Frequency********************”)

print(“Income Per Person:”)

print(“Income Freq”)

print(p2)

print(“Alcohol Consumption:”)

print(“Alcohol Freq”)

print(p3)

print(“Life Expecectancy:”)

print(“Life Freq”)

print(p4)

minMax = OrderedDict()

dict1 = OrderedDict()

dict1[‘min’] = data.life.min()

dict1[‘max’] = data.life.max()

minMax[‘life’] = dict1

dict2 = OrderedDict()

dict2[‘min’] = data.income.min()

dict2[‘max’] = data.income.max()

minMax[‘income’] = dict2

dict3 = OrderedDict()

dict3[‘min’] = data.alcohol.min()

dict3[‘max’] = data.alcohol.max()

minMax[‘alcohol’] = dict3

df = pandas.DataFrame([minMax[‘income’],minMax[‘life’],minMax[‘alcohol’]], index = [‘Income’,’Life’,’Alcohol’])

print (df.sort_index(axis=1, ascending=False))

dummyData = data.copy()

# Maps

income_map = {1: ‘>=100 <5k’, 2: ‘>=5k <10k’, 3: ‘>=10k <20k’,

4: ‘>=20K < 30K’, 5: ‘>=30K <40K’, 6: ‘>=40K <50K’ }

life_map = {1: ‘>=40 <50’, 2: ‘>=50 <60’, 3: ‘>=60 <70’, 4: ‘>=70 <80’, 5: ‘>=80 <90’}

alcohol_map = {1: ‘>=0.5 <5’, 2: ‘>=5 <10’, 3: ‘>=10 <15’, 4: ‘>=15 <20’, 5: ‘>=20 <25’}

dummyData[‘income’] = pandas.cut(data.income,[100,5000,10000,20000,30000,40000,50000], labels=[‘1’,’2',’3',’4',’5',’6'])

print(dummyData.head(10))

dummyData[‘life’] = pandas.cut(data.life,[40,50,60,70,80,90], labels=[‘1’,’2',’3',’4',’5'])

print(dummyData.head(10))

dummyData[‘alcohol’] = pandas.cut(data.alcohol,[0.5,5,10,15,20,25], labels=[‘1’,’2',’3',’4',’5'])

print (dummyData.head(10))

c2 = dummyData[‘income’].value_counts(sort=False)

p2 = dummyData[‘income’].value_counts(sort=False, normalize=True)

c3 = dummyData[‘alcohol’].value_counts(sort=False)

p3 = dummyData[‘alcohol’].value_counts(sort=False, normalize=True)

c4 = dummyData[‘life’].value_counts(sort=False)

p4 = dummyData[‘life’].value_counts(sort=False, normalize=True)

print(“****************************************************”)

print(“*************Absolute Frequency********************”)

print(“Income Per Person:”)

print(“Income-Freq”)

print(c2)

print(“Alcohol Consumption:”)

print(“Alcohol-Freq”)

print(c3)

print(“Life Expecectancy:”)

print(“Life-Freq”)

print(c4)

print(“****************************************************”)

print(“*************Relative Frequency********************”)

print(“Income Per Person:”)

print(“Income- Freq”)

print(p2)

print(“Alcohol Consumption:”)

print(“Alcohol-Freq”)

print(p3)

print(“Life Expecectancy:”)

print(“Life-Freq”)

print(p4)

--

--

Dipika Parbatsinh Pawar

Student at Ahmedabad University| Data Science Enthusiast| Backend Developer