用Random Forest 和 Logistic Regression 两个模型来分析Titanic数据集
Random Forest 代码实现: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 import numpy as np import pandas as pd from sklearn.ensemble import RandomForestClassifier train_data = pd.read_csv("Input/titanic/train.csv" ) test_data = pd.read_csv("Input/titanic/test.csv" ) women = train_data.loc[train_data.Sex == 'female' ]["Survived" ] men = train_data.loc[train_data.Sex == 'male' ]["Survived" ] rate_men = sum (men)/len (men) rate_women = sum (women)/len (women) y = train_data["Survived" ] features = ["Pclass" , "Sex" , "SibSp" , "Parch" ] X = pd.get_dummies(train_data[features])print (X.head(10 )) X_test = pd.get_dummies(test_data[features]) model = RandomForestClassifier(n_estimators=100 , max_depth=5 , random_state=1 )""" n_estimators:这个参数指定了随机森林中决策树的数量(默认值为100)。更多的决策树可以提高模型的稳定性和准确性,但同时也会增加计算开销。 max_depth:这个参数指定了每棵决策树的最大深度(默认值为None)。限制决策树的深度可以防止过拟合,但如果设置得太小,模型可能无法捕捉到复杂的关系。 random_state:这个参数控制随机性的种子(默认值为None)。设置random_state可以使得每次运行模型时得到相同的结果,便于结果的复现性。 """ model.fit(X, y) predictions = model.predict(X_test) output = pd.DataFrame({'PassengerId' : test_data.PassengerId, 'Survived' : predictions}) output.to_csv('submission_randomforest.csv' , index=False )print ("Your submission was successfully saved!" )
Logic Regression(softmax) 说说softmax Regression和Logic Regression 的区别
逻辑回归是一种二分类模型,而Softmax回归是一种多分类模型。当类别数为2时,Softmax回归就是逻辑回归。
Softmax回归(Softmax Regression)和逻辑回归(Logistic Regression)是在某种程度上相似,但并不完全相同。
逻辑回归是一种用于解决二分类问题的线性模型。它使用sigmoid函数将输入映射到0和1之间的概率值,并根据设定的阈值进行分类。逻辑回归可以看作是一种特殊情况下的Softmax回归,当只有两个类别时,Softmax回归即为逻辑回归。
而Softmax回归是一种用于多分类问题的模型。它通过对输入样本进行线性变换,然后使用Softmax函数将线性变换后的值转换为表示各个类别概率的向量。Softmax函数将每个类别的概率归一化,使其总和为1。然后根据最高概率所对应的类别进行分类。
代码实现 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 import numpy as np import pandas as pd from sklearn.linear_model import LogisticRegression train_data = pd.read_csv("Input/titanic/train.csv" ) test_data = pd.read_csv("Input/titanic/test.csv" ) women = train_data.loc[train_data.Sex == 'female' ]["Survived" ] men = train_data.loc[train_data.Sex == 'male' ]["Survived" ] rate_men = sum (men) / len (men) rate_women = sum (women) / len (women) y = train_data["Survived" ] features = ["Pclass" , "Sex" , "SibSp" , "Parch" ] X = pd.get_dummies(train_data[features]) X_test = pd.get_dummies(test_data[features]) model = LogisticRegression() model.fit(X, y) predictions = model.predict(X_test) output = pd.DataFrame({'PassengerId' : test_data.PassengerId, 'Survived' : predictions}) output.to_csv('submission_logic.csv' , index=False )print ("Your submission was successfully saved!" )