Python 数据分析基础小结

一、数据读取 1、读写数据库数据 读取函数: pandas.read_sql_table(table_name, con, schema=None, index_col=None, coerce_float=True, columns=None) pandas.read_sql_query(sql, con, index_col=None, coerce_float=True) pandas.read_sql(sql, con, index_col=None, coerce_float=True, columns=None) sqlalchemy.creat_engine('数据库产品名+连接工具名://用户名:密码@数据库IP地址:数据库端口号/数据库名称?charset = 数据库数据编码') 写出函数: DataFrame.to_sql(name, con, schema=None, if_exists=’fail’, index=True, index_label=None, dtype=None) 2、读写文本文件/csv数据 读取函数: pandas.read_table(filepath_or_buffer, sep=’\t’, header=’infer’, names=None, index_col=None, dtype=None, engine=None, nrows=None) pandas.read_csv(filepath_or_buffer, sep=’,’, header=’infer’, names=None, index_col=None, dtype=None, engine=None, nrows=None) 写出函数: DataFrame.to_csv(path_or_buf=None, sep=’,’, na_rep=”, columns=None, header=True, index=True,index_label=None,mode=’w’,encoding=None) 3、读写excel(xls/xlsx)数据 读取函数: pandas.read_excel(io, sheetname=0, header=0, index_col=None, names=None, dtype=None) 写出函数: DataFrame.to_excel(excel_writer=None, sheetname=None’, na_rep=”, header=True, index=True, index_label=None, mode=’w’, encoding=None) 4、读取剪贴板数据: pandas.read_clipboard() 二、数据预处理 1、数据清洗 重复数据处理 样本重复: pandas.DataFrame(Series).drop_duplicates(self, subset=None, keep='first', inplace=False) 特征重复: 通用 def FeatureEquals(df): dfEquals=pd.DataFrame([],columns=df.columns,index=df.columns) for i in df.columns: for j in df.columns: dfEquals.loc[i,j]=df.loc[:,i].equals(df.loc[:,j]) return dfEquals 数值型特征 def drop_features(data,way = 'pearson',assoRate = 1.0): ''' 此函数用于求取相似度大于assoRate的两列中的一个,主要目的用于去除数值型特征的重复 data:数据框,无默认 assoRate:相似度,默认为1 ''' assoMat = data.corr(method = way) delCol = [] length = len(assoMat) for i in range(length): for j in range(i+1,length): if assoMat.iloc[i,j] >= assoRate: delCol.append(assoMat.columns[j]) return(delCol) 缺失值处理 识别缺失值 DataFrame.isnull() DataFrame.notnull() DataFrame.isna() DataFrame.notna() 处理缺失值 删除:DataFrame.dropna(self, axis=0, how='any', thresh=None, subset=None, inplace=False) 定值填补: DataFrame.fillna(value=None, method=None, axis=None, inplace=False, limit=None) 插补: DataFrame.interpolate(method=’linear’, axis=0, limit=None, inplace=False,limit_direction=’forward’, limit_area=None, downcast=None,**kwargs) 异常值处理 3σ原则 def outRange(Ser1): boolInd = (Ser1.mean()-3*Ser1.std()>Ser1) | (Ser1.mean()+3*Ser1.var()< Ser1) index = np.arange(Ser1.shape[0])[boolInd] outrange = Ser1.iloc[index] return outrange 注: 此方法只适用于正态分布 箱线图分析 def boxOutRange(Ser): ''' Ser:进行异常值分析的DataFrame的某一列 ''' Low = Ser.quantile(0.25)-1.5*(Ser.quantile(0.75)-Ser.quantile(0.25)) Up = Ser.quantile(0.75)+1.5*(Ser.quantile(0.75)-Ser.quantile(0.25)) index = (Ser< Low) | (Ser>Up) Outlier = Ser.loc[index] return(Outlier) 2、合并数据 数据堆叠:pandas.concat(objs, axis=0, join='outer', join_axes=None, ignore_index=False, keys=None, levels=None, names=None, verify_integrity=False, copy=True) 主键合并:pandas.merge(left, right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=False,suffixes=('_x', '_y'), copy=True, indicator=False) 重叠合并:pandas.DataFrame.combine_first(self, other) 3、数据变换 哑变量处理:pandas.get_dummies(data, prefix=None, prefix_sep='_', dummy_na=False, columns=None, sparse=False, drop_first=False) 数据离散化:pandas.cut(x, bins, right=True, labels=None, retbins=False, precision=3, include_lowest=False) 4、数据标准化 标准差标准化:sklearn.preprocessing.StandardScaler 离差标准化: sklearn.preprocessing.MinMaxScaler 三、模型构建 1、训练集测试集划分 sklearn.model_selection.train_test_split(*arrays, **options) 2、 降维 class sklearn.decomposition.PCA(n_components=None, copy=True, whiten=False, svd_solver=’auto’, tol=0.0, iterated_power=’auto’, random_state=None) 3、交叉验证 sklearn.model_selection.cross_validate(estimator, X, y=None, groups=None, scoring=None, cv=None, n_jobs=1, verbose=0, fit_params=None, pre_dispatch=‘2*n_jobs’, return_train_score=’warn’) 4、模型训练与预测 有监督模型 clf = lr.fit(X_train, y_train) clf.predict(X_test) 5、聚类 常用算法: K均值:class sklearn.cluster.KMeans(n_clusters=8, init=’k-means++’, n_init=10, max_iter=300, tol=0.0001, precompute_distances=’auto’, verbose=0, random_state=None, copy_x=True, n_jobs=1, algorithm=’auto’) DBSCAN密度聚类:class sklearn.cluster.DBSCAN(eps=0.5, min_samples=5, metric=’euclidean’, metric_params=None, algorithm=’auto’, leaf_size=30, p=None, n_jobs=1) Birch层次聚类:class sklearn.cluster.Birch(threshold=0.5, branching_factor=50, n_clusters=3, compute_labels=True, copy=True) 评价: 轮廓系数:sklearn.metrics.silhouette_score(X, labels, metric=’euclidean’, sample_size=None, random_state=None, **kwds) calinski_harabaz_score:sklearn.metrics.calinski_harabaz_score(X, labels) completeness_score:sklearn.metrics.completeness_score(labels_true, labels_pred) fowlkes_mallows_score:sklearn.metrics.fowlkes_mallows_score(labels_true, labels_pred, sparse=False) homogeneity_completeness_v_measure:sklearn.metrics.homogeneity_completeness_v_measure(labels_true, labels_pred) adjusted_rand_score:sklearn.metrics.adjusted_rand_score(labels_true, labels_pred) homogeneity_score:sklearn.metrics.homogeneity_score(labels_true, labels_pred) mutual_info_score:sklearn.metrics.mutual_info_score(labels_true, labels_pred, contingency=None) normalized_mutual_info_score:sklearn.metrics.normalized_mutual_info_score(labels_true, labels_pred) v_measure_score:sklearn.metrics.v_measure_score(labels_true, labels_pred) 注:后续含labels_true参数的均需真实值参与 6、分类 常用算法 Adaboost分类:class sklearn.ensemble.AdaBoostClassifier(base_estimator=None, n_estimators=50, learning_rate=1.0, algorithm=’SAMME.R’, random_state=None) 梯度提升树分类:class sklearn.ensemble.GradientBoostingClassifier(loss=’deviance’, learning_rate=0.1, n_estimators=100, subsample=1.0, criterion=’friedman_mse’, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_depth=3, min_impurity_decrease=0.0, min_impurity_split=None, init=None, random_state=None, max_features=None, verbose=0, max_leaf_nodes=None, warm_start=False, presort=’auto’) 随机森林分类:class sklearn.ensemble.RandomForestClassifier(n_estimators=10, criterion=’gini’, max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features=’auto’, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, bootstrap=True, oob_score=False, n_jobs=1, random_state=None, verbose=0, warm_start=False, class_weight=None) 高斯过程分类:class sklearn.gaussian_process.GaussianProcessClassifier(kernel=None, optimizer=’fmin_l_bfgs_b’, n_restarts_optimizer=0, max_iter_predict=100, warm_start=False, copy_X_train=True, random_state=None, multi_class=’one_vs_rest’, n_jobs=1) 逻辑回归:class sklearn.linear_model.LogisticRegression(penalty=’l2’, dual=False, tol=0.0001, C=1.0, fit_intercept=True, intercept_scaling=1, class_weight=None, random_state=None, solver=’liblinear’, max_iter=100, multi_class=’ovr’, verbose=0, warm_start=False, n_jobs=1) KNN:class sklearn.neighbors.KNeighborsClassifier(n_neighbors=5, weights=’uniform’, algorithm=’auto’, leaf_size=30, p=2, metric=’minkowski’, metric_params=None, n_jobs=1, **kwargs) 多层感知神经网络:class sklearn.neural_network.MLPClassifier(hidden_layer_sizes=(100, ), activation=’relu’, solver=’adam’, alpha=0.0001, batch_size=’auto’, learning_rate=’constant’, learning_rate_init=0.001, power_t=0.5, max_iter=200, shuffle=True, random_state=None, tol=0.0001, verbose=False, warm_start=False, momentum=0.9, nesterovs_momentum=True, early_stopping=False, validation_fraction=0.1, beta_1=0.9, beta_2=0.999, epsilon=1e-08) SVM:class sklearn.svm.SVC(C=1.0, kernel=’rbf’, degree=3, gamma=’auto’, coef0=0.0, shrinking=True, probability=False, tol=0.001, cache_size=200, class_weight=None, verbose=False, max_iter=-1, decision_function_shape=’ovr’, random_state=None) 决策树:class sklearn.tree.DecisionTreeClassifier(criterion=’gini’, splitter=’best’, max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features=None, random_state=None, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, class_weight=None, presort=False) 评价: 准确率:sklearn.metrics.accuracy_score(y_true, y_pred, normalize=True, sample_weight=None) AUC:sklearn.metrics.auc(x, y, reorder=False) 分类报告:sklearn.metrics.classification_report(y_true, y_pred, labels=None, target_names=None, sample_weight=None, digits=2) 混淆矩阵:sklearn.metrics.confusion_matrix(y_true, y_pred, labels=None, sample_weight=None) kappa:sklearn.metrics.cohen_kappa_score(y1, y2, labels=None, weights=None, sample_weight=None) F1值:sklearn.metrics.f1_score(y_true, y_pred, labels=None, pos_label=1, average=’binary’, sample_weight=None) 精确率:sklearn.metrics.precision_score(y_true, y_pred, labels=None, pos_label=1, average=’binary’, sample_weight=None) 召回率:sklearn.metrics.recall_score(y_true, y_pred, labels=None, pos_label=1, average=’binary’, sample_weight=None) ROC:sklearn.metrics.roc_curve(y_true, y_score, pos_label=None, sample_weight=None, drop_intermediate=True) 7、回归 常用算法: Adaboost回归:class sklearn.ensemble.AdaBoostRegressor(base_estimator=None, n_estimators=50, learning_rate=1.0, loss=’linear’, random_state=None) 梯度提升树回归:class sklearn.ensemble.GradientBoostingRegressor(loss=’ls’, learning_rate=0.1, n_estimators=100, subsample=1.0, criterion=’friedman_mse’, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_depth=3, min_impurity_decrease=0.0, min_impurity_split=None, init=None, random_state=None, max_features=None, alpha=0.9, verbose=0, max_leaf_nodes=None, warm_start=False, presort=’auto’) 随机森林回归:class sklearn.ensemble.RandomForestRegressor(n_estimators=10, criterion=’mse’, max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features=’auto’, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, bootstrap=True, oob_score=False, n_jobs=1, random_state=None, verbose=0, warm_start=False) 高斯过程回归:class sklearn.gaussian_process.GaussianProcessRegressor(kernel=None, alpha=1e-10, optimizer=’fmin_l_bfgs_b’, n_restarts_optimizer=0, normalize_y=False, copy_X_train=True, random_state=None) 保序回归:class sklearn.isotonic.IsotonicRegression(y_min=None, y_max=None, increasing=True, out_of_bounds=’nan’) Lasso回归:class sklearn.linear_model.Lasso(alpha=1.0, fit_intercept=True, normalize=False, precompute=False, copy_X=True, max_iter=1000, tol=0.0001, warm_start=False, positive=False, random_state=None, selection=’cyclic’) 线性回归:class sklearn.linear_model.LinearRegression(fit_intercept=True, normalize=False, copy_X=True, n_jobs=1) 岭回归: class sklearn.linear_model.Ridge(alpha=1.0, fit_intercept=True, normalize=False, copy_X=True, max_iter=None, tol=0.001, solver=’auto’, random_state=None) KNN回归:class sklearn.neighbors.KNeighborsRegressor(n_neighbors=5, weights=’uniform’, algorithm=’auto’, leaf_size=30, p=2, metric=’minkowski’, metric_params=None, n_jobs=1, **kwargs) 多层感知神经网络回归:class sklearn.neural_network.MLPRegressor(hidden_layer_sizes=(100, ), activation=’relu’, solver=’adam’, alpha=0.0001, batch_size=’auto’, learning_rate=’constant’, learning_rate_init=0.001, power_t=0.5, max_iter=200, shuffle=True, random_state=None, tol=0.0001, verbose=False, warm_start=False, momentum=0.9, nesterovs_momentum=True, early_stopping=False, validation_fraction=0.1, beta_1=0.9, beta_2=0.999, epsilon=1e-08) SVM回归:class sklearn.svm.SVR(kernel=’rbf’, degree=3, gamma=’auto’, coef0=0.0, tol=0.001, C=1.0, epsilon=0.1, shrinking=True, cache_size=200, verbose=False, max_iter=-1) 决策树回归:class sklearn.tree.DecisionTreeRegressor(criterion=’mse’, splitter=’best’, max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features=None, random_state=None, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, presort=False) 评价: 可解释方差值:sklearn.metrics.explained_variance_score(y_true, y_pred, sample_weight=None, multioutput=’uniform_average’) 平均绝对误差:sklearn.metrics.mean_absolute_error(y_true, y_pred, sample_weight=None, multioutput=’uniform_average’)[source] 均方误差:sklearn.metrics.mean_squared_error(y_true, y_pred, sample_weight=None, multioutput=’uniform_average’) 均方对数误差:sklearn.metrics.mean_squared_log_error(y_true, y_pred, sample_weight=None, multioutput=’uniform_average’) 中值绝对误差:sklearn.metrics.median_absolute_error(y_true, y_pred) R²值:sklearn.metrics.r2_score(y_true, y_pred, sample_weight=None, multioutput=’uniform_average’) 八、demo from sklearn import neighbors, datasets, preprocessing from sklearn.cross_validation import train_test_split from sklearn.metrics import accuracy_score iris = datasets.load_iris() X, y = iris.data, iris.target X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=33) scaler = preprocessing.StandardScaler().fit(X_train) X_train = scaler.transform(X_train) X_test = scaler.transform(X_test) knn = neighbors.KNeighborsClassifier(n_neighbors=5) knn.fit(X_train, y_train) y_pred = knn.predict(X_test) accuracy_score(y_test, y_pred) 四、绘图 1、创建画布或子图 函数名称 函数作用 plt.figure 创建一个空白画布,可以指定画布大小,像素。 figure.add_subplot 创建并选中子图,可以指定子图的行数,列数,与选中图片编号。 2、绘制 函数名称 函数作用 plt.title 在当前图形中添加标题,可以指定标题的名称,位置,颜色,字体大小等参数。 plt.xlabel 在当前图形中添加x轴名称,可以指定位置,颜色,字体大小等参数。 plt.ylabel 在当前图形中添加y轴名称,可以指定位置,颜色,字体大小等参数。 plt.xlim 指定当前图形x轴的范围,只能确定一个数值区间,而无法使用字符串标识。 plt.ylim 指定当前图形y轴的范围,只能确定一个数值区间,而无法使用字符串标识。 plt.xticks 指定x轴刻度的数目与取值 plt.yticks 指定y轴刻度的数目与取值 plt.legend 指定当前图形的图例,可以指定图例的大小,位置,标签。 3、中文 plt.rcParams['font.sans-serif'] = 'SimHei' ##设置字体为SimHei显示中文 plt.rcParams['axes.unicode_minus'] = False ##设置正常显示符号 4、不同图形 散点图:matplotlib.pyplot.scatter(x, y, s=None, c=None, marker=None, cmap=None, norm=None, vmin=None, vmax=None, alpha=None,
50000+
5万行代码练就真实本领
17年
创办于2008年老牌培训机构
1000+
合作企业
98%
就业率

联系我们

电话咨询

0532-85025005

扫码添加微信