数据处理的统计学习（scikit-learn教程）

发布时间：2020-12-25 23:30:06 所属栏目：大数据来源：网络整理

导读：副标题#e# 数据挖掘入门与实战 ?公众号： datadw Scikit-learn 是一个紧密结合Python科学计算库(Numpy、Scipy、matplotlib)，集成经典机器学习算法的Python模块。一、统计学习：scikit-learn中的设置与评估函数对象（1）数据集 scikit-learn 从二维数组描

分裂：自上而下的方法：所有的观测样例开始于同一个簇。迭代的进行分层。对于预计簇很多的情况，这种方法既慢（由于所有的观测样例作为一个簇开始的，是递归进行分离的）又有统计学行的病态。

连同-驱使聚类（Conectivity-constrained clustering）
使用凝聚聚类，通过一个连通图可以指定某些样例能被聚集在一起。scikit-learn中的图通过邻接矩阵来表示，且通常是一个稀疏矩阵。例如，在聚类一张图片时检索连通区域（有时也被称作连同单元、部件）：

from sklearn.feature_extraction.image import grid_to_graphfrom sklearn.cluster import AgglomerativeClustering################################################################################ Generate datalena = sp.misc.lena()# Downsample the image by a factor of 4lena = lena[::2,::2] + lena[1::2,::2] + lena[::2,1::2] + lena[1::2,1::2]
X = np.reshape(lena,(-1,1))################################################################################ Define the structure A of the data. Pixels connected to their neighbors.connectivity = grid_to_graph(*lena.shape)################################################################################ Compute clusteringprint("Compute structured hierarchical clustering...")
st = time.time()
n_clusters = 15 ?# number of regionsward = AgglomerativeClustering(n_clusters=n_clusters,? ?linkage='ward',connectivity=connectivity).fit(X)
label = np.reshape(ward.labels_,lena.shape)print("Elapsed time: ",time.time() - st)print("Number of pixels: ",label.size)print("Number of clusters: ",np.unique(label).size)

特征凝聚：
我们已经知道稀疏性可以缓和高维灾难。i.e相对于特征数量观测样例数量不足的情况。另一种方法是合并相似的特征：特征凝聚。这种方法通过在特征方向上进行聚类实现。在特征方向上聚类也可以理解为聚合转置的数据。

digits = datasets.load_digits()
images = digits.images
X = np.reshape(images,(len(images),-1))
connectivity = grid_to_graph(*images[0].shape)
agglo = cluster.FeatureAgglomeration(connectivity=connectivity,? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? n_clusters=32)
agglo.fit(X) 
X_reduced = agglo.transform(X)
X_approx = agglo.inverse_transform(X_reduced)
images_approx = np.reshape(X_approx,images.shape)

transeform 和invers_transeform方法
有些模型带有转置方法。例如用来降低数据集的维度

（2）分解：从一个信号到成分和加载

成分及其加载：
如果X是我们的多变量数据，那么我们要要尝试解决的问题就是在不同的观测样例上复写写它：我们想要学习加载L和其它一系列的成分C，如X = LC。存在不同的标准和条件去选择成分。

主成分分析：PCA
主成分分析（PCA）选择在信号上解释极大方差的连续成分。

上面观测样例的点分布在一个方向上是非常平坦的：三个特征单变量的一个甚至可以有其他两个准确的计算出来。PCA用来发现数据在哪个方向上是不平坦的。

当被用来转换数据的时候，PCA可以通过投射到一个主子空间来降低数据的维度。：

# Create a signal with only 2 useful dimensionsx1 = np.random.normal(size=100)
x2 = np.random.normal(size=100)
x3 = x1 + x2
X = np.c_[x1,x2,x3]from sklearn import decomposition
pca = decomposition.PCA()
pca.fit(X)print(pca.explained_variance_) ?# As we can see,only the 2 first components are usefulpca.n_components = 2X_reduced = pca.fit_transform(X)
X_reduced.shape

独立成分分析：ICA
独立成分分析（ICA）选择合适的成分使得他们的分布载有最大的独立信息量。可以恢复非高斯独立信号：

# Generate sample datatime = np.linspace(0,10,2000)
s1 = np.sin(2 * time) ?# Signal 1 : sinusoidal signals2 = np.sign(np.sin(3 * time)) ?# Signal 2 : square signalS = np.c_[s1,s2]
S += 0.2 * np.random.normal(size=S.shape) ?# Add noiseS /= S.std(axis=0) ?# Standardize data# Mix dataA = np.array([[1,[0.5,2]]) ?# Mixing matrixX = np.dot(S,A.T) ?# Generate observations# Compute ICAica = decomposition.FastICA()
S_ = ica.fit_transform(X) ?# Get the estimated sourcesA_ = ica.mixing_.T
np.allclose(X,?np.dot(S_,A_) + ica.mean_)

五、联合起来

（1）管道（流水线）

我们已经知道了一些估测器（模型）能够转换数据，一些可以预测变量。我们也能够将其结合到一起：

from sklearn import linear_model,decomposition,datasetsfrom sklearn.pipeline import Pipelinefrom sklearn.grid_search import GridSearchCV
logistic = linear_model.LogisticRegression()
pca = decomposition.PCA()
pipe = Pipeline(steps=[('pca',pca),('logistic',logistic)])
digits = datasets.load_digits()
X_digits = digits.data
y_digits = digits.target################################################################################ Plot the PCA spectrumpca.fit(X_digits)
plt.figure(1,figsize=(4,3))
plt.clf()
plt.axes([.2,.2,.7,.7])
plt.plot(pca.explained_variance_,linewidth=2)
plt.axis('tight')
plt.xlabel('n_components')
plt.ylabel('explained_variance_')################################################################################ Predictionn_components = [20,40,64]
Cs = np.logspace(-4,4,3)
#Parameters of pipelines can be set using ‘__’ separated parameter names:estimator = GridSearchCV(pipe,? ? ? ? ? ? ? ? ? ? ? ? dict(pca__n_components=n_components,? ? ? ? ? ? ? ? ? ? ? ? ? ? ?logistic__C=Cs))
estimator.fit(X_digits,y_digits)
plt.axvline(estimator.best_estimator_.named_steps['pca'].n_components,? ? ? ? ? ?linestyle=':',label='n_components chosen')
plt.legend(prop=dict(size=12))

（2）使用特征联进行人脸识别

? ?该实例使用的数据集是从“Labeled Faces in the Wild”节选预处理得到的。更为熟知的名字是LFW。

http://vis-www.cs.umass.edu/lfw/lfw-funneled.tgz（233 MB）

（编辑：济南站长网）

【声明】本站内容均来自网络，其相关言论仅代表作者个人观点，不代表本站立场。若无意侵犯到您的权利，请及时与联系站长删除相关内容!

9/11

首页

尾页

2022年的5个主要的数据	汽车公司和移动通信公
大数据创新任重道远	云计算和大数据成为了