我正在使用python教自己一些逻辑回归.我想在演练中应用的经验教训在这里在维基百科条目中的小数据集在这里.
有些东西看起来不太对劲.维基百科和Excel解算器(使用此视频中的方法验证)给出拦截-4.0777和系数1.5046,但是我从github示例构建的代码分别输出-0.924200和0.756024.
我试图使用的代码如下.有什么明显的错误吗?
import numpy as np import pandas as pd from patsy import dmatrices from sklearn.linear_model import LogisticRegression X = [0.5,0.75,1.0,1.25,1.5,1.75,1.75,2.0,2.25,2.5,2.75,3.0,3.25, 3.5,4.0,4.25,4.5,4.75,5.0,5.5] y = [0,0,0,0,0,0,1,0,1,0,1,0,1,0,1,1,1,1,1,1] zipped = list(zip(X,y)) df = pd.DataFrame(zipped,columns = ['study_hrs','p_or_f']) y, X = dmatrices('p_or_f ~ study_hrs', df, return_type="dataframe") y = np.ravel(y) model = LogisticRegression() model = model.fit(X,y) print(pd.DataFrame(np.transpose(model.coef_),X.columns)) >>> 0 Intercept -0.924200 study_hrs 0.756024
lejlot.. 11
只需将模型创建行更改为
model = LogisticRegression(C=100000, fit_intercept=False)
默认情况下,sklearn通过拟合强度(小C大正则化,大C小正则化)解决正则化LogisticRegressionC=1
.
该类使用liblinear库,newton-cg和lbfgs求解器实现正则化逻辑回归.它可以处理密集和稀疏输入.使用包含64位浮点数的C有序数组或CSR矩阵以获得最佳性能; 任何其他输入格式将被转换(和复制).
因此,要获得他们的模型,你应该适合
model = LogisticRegression(C=1000000)
这使
Intercept -2.038853 # this is actually half the intercept study_hrs 1.504643 # this is correct
此外,问题还在于您在patsy中处理数据的方式,请参阅简化的正确示例
import numpy as np from sklearn.linear_model import LogisticRegression X = [0.5,0.75,1.0,1.25,1.5,1.75,1.75,2.0,2.25,2.5,2.75,3.0,3.25, 3.5,4.0,4.25,4.5,4.75,5.0,5.5] y = [0,0,0,0,0,0,1,0,1,0,1,0,1,0,1,1,1,1,1,1] X = np.array([[x] for x in X]) y = np.ravel(y) model = LogisticRegression(C=1000000.) model = model.fit(X,y) print('coef', model.coef_) print('intercept', model.intercept_)
给
coef [[ 1.50464059]] intercept [-4.07769916]
究竟是什么问题?dmatrices
默认情况下执行此操作时,将输入数据嵌入一列(偏差)
X = [0.5,0.75,1.0,1.25,1.5,1.75,1.75,2.0,2.25,2.5,2.75,3.0,3.25, 3.5,4.0,4.25,4.5,4.75,5.0,5.5] y = [0,0,0,0,0,0,1,0,1,0,1,0,1,0,1,1,1,1,1,1] zipped = list(zip(X,y)) df = pd.DataFrame(zipped,columns = ['study_hrs','p_or_f']) y, X = dmatrices('p_or_f ~ study_hrs', df, return_type="dataframe") print(X)
这导致
Intercept study_hrs 0 1 0.50 1 1 0.75 2 1 1.00 3 1 1.25 4 1 1.50 5 1 1.75 6 1 1.75 7 1 2.00 8 1 2.25 9 1 2.50 10 1 2.75 11 1 3.00 12 1 3.25 13 1 3.50 14 1 4.00 15 1 4.25 16 1 4.50 17 1 4.75 18 1 5.00 19 1 5.50
这就是为什么由此产生的偏差只是真实偏差的一半 - scikit学会也添加了一列......所以你现在有两个偏见,因此最佳解决方案是给它们每一个重量的一半给一个人.
那么你能做什么?
不要以这种方式使用patsy
禁止patsy添加偏见
告诉sklearn不要添加偏见
.
import numpy as np import pandas as pd from patsy import dmatrices from sklearn.linear_model import LogisticRegression X = [0.5,0.75,1.0,1.25,1.5,1.75,1.75,2.0,2.25,2.5,2.75,3.0,3.25, 3.5,4.0,4.25,4.5,4.75,5.0,5.5] y = [0,0,0,0,0,0,1,0,1,0,1,0,1,0,1,1,1,1,1,1] zipped = list(zip(X,y)) df = pd.DataFrame(zipped,columns = ['study_hrs','p_or_f']) y, X = dmatrices('p_or_f ~ study_hrs', df, return_type="dataframe") y = np.ravel(y) model = LogisticRegression(C=100000, fit_intercept=False) model = model.fit(X,y) print(pd.DataFrame(np.transpose(model.coef_),X.columns))
给
Intercept -4.077571 study_hrs 1.504597
如预期的
只需将模型创建行更改为
model = LogisticRegression(C=100000, fit_intercept=False)
默认情况下,sklearn通过拟合强度(小C大正则化,大C小正则化)解决正则化LogisticRegressionC=1
.
该类使用liblinear库,newton-cg和lbfgs求解器实现正则化逻辑回归.它可以处理密集和稀疏输入.使用包含64位浮点数的C有序数组或CSR矩阵以获得最佳性能; 任何其他输入格式将被转换(和复制).
因此,要获得他们的模型,你应该适合
model = LogisticRegression(C=1000000)
这使
Intercept -2.038853 # this is actually half the intercept study_hrs 1.504643 # this is correct
此外,问题还在于您在patsy中处理数据的方式,请参阅简化的正确示例
import numpy as np from sklearn.linear_model import LogisticRegression X = [0.5,0.75,1.0,1.25,1.5,1.75,1.75,2.0,2.25,2.5,2.75,3.0,3.25, 3.5,4.0,4.25,4.5,4.75,5.0,5.5] y = [0,0,0,0,0,0,1,0,1,0,1,0,1,0,1,1,1,1,1,1] X = np.array([[x] for x in X]) y = np.ravel(y) model = LogisticRegression(C=1000000.) model = model.fit(X,y) print('coef', model.coef_) print('intercept', model.intercept_)
给
coef [[ 1.50464059]] intercept [-4.07769916]
究竟是什么问题?dmatrices
默认情况下执行此操作时,将输入数据嵌入一列(偏差)
X = [0.5,0.75,1.0,1.25,1.5,1.75,1.75,2.0,2.25,2.5,2.75,3.0,3.25, 3.5,4.0,4.25,4.5,4.75,5.0,5.5] y = [0,0,0,0,0,0,1,0,1,0,1,0,1,0,1,1,1,1,1,1] zipped = list(zip(X,y)) df = pd.DataFrame(zipped,columns = ['study_hrs','p_or_f']) y, X = dmatrices('p_or_f ~ study_hrs', df, return_type="dataframe") print(X)
这导致
Intercept study_hrs 0 1 0.50 1 1 0.75 2 1 1.00 3 1 1.25 4 1 1.50 5 1 1.75 6 1 1.75 7 1 2.00 8 1 2.25 9 1 2.50 10 1 2.75 11 1 3.00 12 1 3.25 13 1 3.50 14 1 4.00 15 1 4.25 16 1 4.50 17 1 4.75 18 1 5.00 19 1 5.50
这就是为什么由此产生的偏差只是真实偏差的一半 - scikit学会也添加了一列......所以你现在有两个偏见,因此最佳解决方案是给它们每一个重量的一半给一个人.
那么你能做什么?
不要以这种方式使用patsy
禁止patsy添加偏见
告诉sklearn不要添加偏见
.
import numpy as np import pandas as pd from patsy import dmatrices from sklearn.linear_model import LogisticRegression X = [0.5,0.75,1.0,1.25,1.5,1.75,1.75,2.0,2.25,2.5,2.75,3.0,3.25, 3.5,4.0,4.25,4.5,4.75,5.0,5.5] y = [0,0,0,0,0,0,1,0,1,0,1,0,1,0,1,1,1,1,1,1] zipped = list(zip(X,y)) df = pd.DataFrame(zipped,columns = ['study_hrs','p_or_f']) y, X = dmatrices('p_or_f ~ study_hrs', df, return_type="dataframe") y = np.ravel(y) model = LogisticRegression(C=100000, fit_intercept=False) model = model.fit(X,y) print(pd.DataFrame(np.transpose(model.coef_),X.columns))
给
Intercept -4.077571 study_hrs 1.504597
如预期的