Panda
import pandas as pd
input_file = (../somedir/somefile.csv)
data = pd.read_csv(input_file)
data.describe()
The results show 8 numbers for each column in your original dataset.
* The first number, the count
, shows how many rows have non-missing values.
* The second value is the mean
, which is the average.
* Under that, std
is the standard deviation, which measures how numerically spread out the values are.
* min
smallest value
* 25%
Percentiles are used in statistics to give you a number that describes the value that a given percent of the values are lower than.
* 50%
means 50% are lower than this number
* 75%
means 75% are lower than this numner
* max
is the largest number
To display all the coloumns in the dataset
data.columns
dropna (drop not available) drops missing values
data = data.dropna(axis=0)
Select the Prdictin target (we are going to predict Price). By convention, the prediction target is called y
y = data.Price
Choose the Features. By convention, this data is called X.
feature = data_features = ['Rooms', 'Bathroom', 'Landsize', 'Lattitude', 'Longtitude']
X = data[feature]
# Verify the fetures
X.describe()
# To display the top 5 rows
X.head()