Python Project With Source Code for Data Analyst – Numpy, Pandas, Matplotlib, Seaborn Libraries

Friends this is a little challenging project for data analysts because here we have to use many Python libraries like Numpy, Pandas, Matplotlib, Seaborn, etc. i have provided an excel file also because I want you also work on a real file. 

Download the file Here – Housing.xlsx

1. What is the average median income of the data set and check the distribution of data using appropriate plots. Please explain the distribution of the plot.

Code – 
import pandas as pd
import matplotlib.pyplot as plt

# Loading the Excel file into a Pandas DataFrame
housing = pd.read_excel('housing.xlsx')

# Extract the column of interest
income = housing['median_income']

# Calculate the average median income
avg_median_income = income.median()
print(avg_median_income)

# Create a histogram
plt.hist(income, bins=30, alpha=0.5, label='My Data')

plt.xlabel('Median Income')
plt.ylabel('Frequency')
plt.title('Distribution of Median Incomes')
plt.legend(loc='upper right')

# Show the plot
plt.show()
Explanation – This is a positively skewed distribution, the mean is greater than the median, and the mode is the smallest value in the dataset. [Moad>Median>Mean]

2. Draw an appropriate plot to see the distribution of housing_median_age and explain your observations.

Code –
data = pd.read_excel('housing.xlsx')
sns.histplot(data=data, x="housing_median_age", bins=20, kde=True)
plt.title("Housing Median Age Distribution")
plt.xlabel("Housing Median Age")
plt.ylabel("Count")
plt.show()
Explanation – 1. The majority of the houses in the dataset have a median age between 15 and 37 years old.

3. Show with the help of visualization, how median_income and median_house_values are related?

sns.scatterplot(data=data, x="median_income", y="median_house_value")
plt.title("Median Income vs Median House Value")
plt.xlabel("Median Income")
plt.ylabel("Median House Value")
plt.show()
Explanation – This showing that as median_income increases, so does median_house_value. both are directly proportional to each other

4. Create a data set by deleting the corresponding examples from the data set for which total_bedrooms are not available.

Code – 
# Load the original dataset
housing = pd.read_excel('housing.xlsx')


# Drop the rows where total_bedrooms is not available
data_with_bedrooms = housing.dropna(subset=['total_bedrooms'])
print(data_with_bedrooms)

5. Create a data set by filling the missing data with the mean value of the total_bedrooms in the original data set.

Code – 
# Load the original dataset
housing = pd.read_excel('housing.xlsx')


# Calculate the mean value of total_bedrooms
mean_bedrooms = housing['total_bedrooms'].mean()
print(mean_bedrooms)


# Fill missing values in total_bedrooms with the mean value
data_with_mean = housing.fillna(value={'total_bedrooms': mean_bedrooms})
print(data_with_mean)

6. Write a programming construct (create a user defined function) to calculate the median value of the data set wherever required.

Code – 
def calculate_median(data):
    sorted_data = sorted(data)
    n = len(data)
    if n % 2 == 0:
        # If there are an even number of values, take the average of the middle two
        middle = n // 2
        median = (sorted_data[middle-1] + sorted_data[middle]) / 2
    else:
        # If there are an odd number of values, take the middle value
        middle = n // 2
        median = sorted_data[middle]
    return median


data = [11, 12, 13, 14, 15]
calculate_median(data)

7. Plot latitude versus longitude and explain your observations.

Code – 
# Load the dataset
housing = pd.read_excel('housing.xlsx')

# Create the scatter plot
plt.scatter(housing['longitude'], housing['latitude'], s=1)

# Add labels and title
plt.xlabel('Longitude')
plt.ylabel('Latitude')
plt.title('Latitude vs. Longitude')

# Show the plot
plt.show()
Explanation – The resulting plot will show the locations of the data points on a 2D plane, with longitude on the x-axis and latitude on the y-axis. both are not depend on each other

8. Create a data set for which the ocean_proximity is ‘Near ocean’.

Code – 
# Load the original dataset
housing = pd.read_excel('housing.xlsx')

# Create a new dataset with ocean_proximity set to 'Near ocean'
near_ocean_data = housing[housing['ocean_proximity'] == 'NEAR OCEAN']
print(near_ocean_data)

9. Find the mean and median of the median income for the data set created in question 8.

Code –
# Load the new dataset
near_ocean_data = pd.read_excel('near_ocean_data.xlsx')

# Calculate the mean of the median income column
median_income_mean = near_ocean_data['median_income'].mean()
print("Mean of median income:", median_income_mean)

# Calculate the median of the median income column
median_income_median = near_ocean_data['median_income'].median()
print("Median of median income:", median_income_median)

10. Please create a new column named total_bedroom_size. If the total bedrooms is 10 or less, it should be quoted as small. If the total bedrooms is 11 or more but less than 1000, it should be medium, otherwise it should be considered large.

Code – 
# Load the original dataset
data = pd.read_excel('housing.xlsx')

# Define a function to categorize the total bedrooms
def categorize_total_bedrooms(num):
    if num <= 10:
        return 'small'
    elif num > 11 and num <= 1000:
        return 'medium'
    else:
        return 'large'

# Create a new column with total bedroom size categories
housing['total_bedroom_size'] = housing['total_bedrooms'].apply(categorize_total_bedrooms)
print(housing)

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top