在GroupKFold
源中,random_state
设置为None
def __init__(self, n_splits=3): super(GroupKFold, self).__init__(n_splits, shuffle=False, random_state=None)
因此,多次运行时(代码来自这里)
import numpy as np from sklearn.model_selection import GroupKFold for i in range(0,10): X = np.array([[1, 2], [3, 4], [5, 6], [7, 8]]) y = np.array([1, 2, 3, 4]) groups = np.array([0, 0, 2, 2]) group_kfold = GroupKFold(n_splits=2) group_kfold.get_n_splits(X, y, groups) print(group_kfold) for train_index, test_index in group_kfold.split(X, y, groups): print("TRAIN:", train_index, "TEST:", test_index) X_train, X_test = X[train_index], X[test_index] y_train, y_test = y[train_index], y[test_index] print(X_train, X_test, y_train, y_test) print print
O/P
GroupKFold(n_splits=2) ('TRAIN:', array([0, 1]), 'TEST:', array([2, 3])) (array([[1, 2], [3, 4]]), array([[5, 6], [7, 8]]), array([1, 2]), array([3, 4])) ('TRAIN:', array([2, 3]), 'TEST:', array([0, 1])) (array([[5, 6], [7, 8]]), array([[1, 2], [3, 4]]), array([3, 4]), array([1, 2])) GroupKFold(n_splits=2) ('TRAIN:', array([0, 1]), 'TEST:', array([2, 3])) (array([[1, 2], [3, 4]]), array([[5, 6], [7, 8]]), array([1, 2]), array([3, 4])) ('TRAIN:', array([2, 3]), 'TEST:', array([0, 1])) (array([[5, 6], [7, 8]]), array([[1, 2], [3, 4]]), array([3, 4]), array([1, 2]))
等......
分裂是相同的.
如何设置random_state
for GroupKFold
以便在交叉验证的几个不同试验中获得不同(但可重复)的分组集?
我想要
GroupKFold(n_splits=2, random_state=42) ('TRAIN:', array([0, 1]), 'TEST:', array([2, 3])) ('TRAIN:', array([2, 3]), 'TEST:', array([0, 1])) GroupKFold(n_splits=2, random_state=13) ('TRAIN:', array([0, 2]), 'TEST:', array([1, 3])) ('TRAIN:', array([1, 3]), 'TEST:', array([0, 2]))
到目前为止,这似乎是一个策略可能是使用sklearn.utils.shuffle
第一,在此建议后.然而,这实际上只是重新排列每个折叠的元素 - 它不会给我们新的分裂.
from sklearn.utils import shuffle from sklearn.model_selection import GroupKFold import numpy as np import sys import pdb random_state = int(sys.argv[1]) X = np.arange(20).reshape((10,2)) y = np.arange(10) groups = np.array([0,0,0,1,2,3,4,5,6,7]) def cv(X, y, groups, random_state): X_s, y_s, groups_s = shuffle(X,y, groups, random_state=random_state) cv_out = GroupKFold(n_splits=2) cv_out_splits = cv_out.split(X_s, y_s, groups_s) for train, test in cv_out_splits: print "---" print X_s[test] print y_s[test] print "test groups", groups_s[test] print "train groups", groups_s[train] pdb.set_trace() print "***" cv(X, y, groups, random_state)
输出:
>python sshuf.py 32 *** --- [[ 2 3] [ 4 5] [ 0 1] [ 8 9] [12 13]] [1 2 0 4 6] test groups [0 0 0 2 4] train groups [7 6 1 3 5] --- [[18 19] [16 17] [ 6 7] [10 11] [14 15]] [9 8 3 5 7] test groups [7 6 1 3 5] train groups [0 0 0 2 4] >python sshuf.py 234 *** --- [[12 13] [ 4 5] [ 0 1] [ 2 3] [ 8 9]] [6 2 0 1 4] test groups [4 0 0 0 2] train groups [7 3 1 5 6] --- [[18 19] [10 11] [ 6 7] [14 15] [16 17]] [9 5 3 7 8] test groups [7 3 1 5 6] train groups [4 0 0 0 2]
joeln.. 8
KFold
只是随机的shuffle=True
.一些数据集不应该被洗牌.
GroupKFold
根本不是随机的.因此random_state=None
.
GroupShuffleSplit
可能更接近你正在寻找的东西.
基于组的分离器的比较:
在GroupKFold
,测试集形成所有数据的完整分区.
LeavePGroupsOut
将所有可能的P组子集组合出来,组合起来; 对于P> 1,测试集将重叠.因为这意味着P ** n_groups
完全分裂,通常你想要一个小的P,并且最常想要的LeaveOneGroupOut
是GroupKFold
与它基本相同的k=1
.
GroupShuffleSplit
没有说明连续测试集之间的关系; 每个列车/测试拆分是独立执行的.
另外,Dmytro Lituiev 提出了一种替代GroupShuffleSplit
算法,它可以更好地在指定的测试集中获得正确数量的样本(不仅仅是正确数量的组)test_size
.
KFold
只是随机的shuffle=True
.一些数据集不应该被洗牌.
GroupKFold
根本不是随机的.因此random_state=None
.
GroupShuffleSplit
可能更接近你正在寻找的东西.
基于组的分离器的比较:
在GroupKFold
,测试集形成所有数据的完整分区.
LeavePGroupsOut
将所有可能的P组子集组合出来,组合起来; 对于P> 1,测试集将重叠.因为这意味着P ** n_groups
完全分裂,通常你想要一个小的P,并且最常想要的LeaveOneGroupOut
是GroupKFold
与它基本相同的k=1
.
GroupShuffleSplit
没有说明连续测试集之间的关系; 每个列车/测试拆分是独立执行的.
另外,Dmytro Lituiev 提出了一种替代GroupShuffleSplit
算法,它可以更好地在指定的测试集中获得正确数量的样本(不仅仅是正确数量的组)test_size
.