1 网络加载

2 读取图像

3 前向传播

4 处理输出

3结果和代码

3.1结果

3.2 代码

参考

在这篇文章中，我们将逐字逐句地尝试找到图片中的单词！基于最近的一篇论文进行文字检测。

EAST: An Efficient and Accurate Scene Text Detector.

https://arxiv.org/abs/1704.03155v2

https://github.com/argman/EAST

应该注意，文本检测不同于文本识别。在文本检测中，我们只检测文本周围的边界框。但是，在文本识别中，我们实际上找到了框中所写的内容。例如，在下面给出的图像中，文本检测将为您提供单词周围的边界框，文本识别将告诉您该框包含单词STOP。本文只进行文本检测。

本文基于tensorflow模型，基于OpenCV调用tensorflow模型。我们将逐步讨论算法是如何工作的。您将需要OpenCV3.4.3以上版本来运行代码。其他opencv DNN模型读取也类似这样步骤。

涉及的步骤如下：

下载EAST模型
将模型加载到内存中
准备输入图像
正向传递blob通过网络
处理输出

1 网络加载

我们将使用cv :: dnn :: readnet(C++版本)或cv2.dnn.ReadNet(python版本)函数将网络加载到内存中。它会根据指定的文件名自动检测配置和框架。在我们的例子中，它是一个pb文件，因此，它将假定要加载Tensorflow网络。和加载图像不大一样，没有模型结构描述文件。

C++

Net net = readNet(model);

Python

net = cv.dnn.readNet(model)

2 读取图像

我们需要创建一个4-D输入blob，用于将图像输送到网络。这是使用blobFromImage函数完成的。

C++

blobFromImage(frame, blob, 1.0, Size(inpWidth, inpHeight), Scalar(123.68, 116.78, 103.94), true, false);

Python

blob = cv.dnn.blobFromImage(frame, 1.0, (inpWidth, inpHeight), (123.68, 116.78, 103.94), True, False)

我们需要为此函数指定一些参数。它们如下：

第一个参数是图像本身。
第二个参数指定每个像素值的缩放。在这种情况下，它不是必需的。因此我们将其保持为1。
第三个参数是设定网络的默认输入为320×320。因此，我们需要在创建blob时指定它。最好和网络输入一致。
第四个参数是训练时候设定的模型均值。需要减去模型均值。
第五个参数是我们是否要交换R和B通道。这是必需的，因为OpenCV使用BGR格式，Tensorflow使用RGB格式，caffe模型使用BGR格式。
最后一个参数是我们是否要裁剪图像并采取中心裁剪。在这种情况下我们指定False。

3 前向传播

现在我们已准备好输入，我们将通过网络传递它。网络有两个输出。一个指定文本框的位置，另一个指定检测到的框的置信度分数。两个输出层如下：

feature_fusion/concat_3

feature_fusion/Conv_7/Sigmoid

这两个输出可以直接用netron这个软件打开pb模型，看到最后输出结果。Netron是一个模型结构可视化神器，支持tf, caffe, keras,mxnet等多种框架。Netron下载地址：

https://github.com/lutzroeder/Netron

c++读取输出代码如下：

std::vector<String> outputLayers(2);

outputLayers[0] = "feature_fusion/Conv_7/Sigmoid";

outputLayers[1] = "feature_fusion/concat_3";

python读取输出代码如下：

outputLayers = []

outputLayers.append("feature_fusion/Conv_7/Sigmoid")

outputLayers.append("feature_fusion/concat_3")

接下来，我们通过将输入图像传递到网络来获得输出。如前所述，输出由两部分组成：置信度和位置。

C++

std::vector<Mat> output;

net.setInput(blob);

net.forward(output, outputLayers);

Mat scores = output[0];

Mat geometry = output[1];

python:



net.setInput(blob)

output = net.forward(outputLayers)

scores = output[0]

geometry = output[1]

4 处理输出

如前所述，我们将使用两个层的输出并解码文本框的位置及其方向。我们可能会得到许多文本框。因此，我们需要从该批次中筛选出看起来最好的文本框。这是使用非极大值抑制算法完成的。

非极大值抑制算法在目标检测中应用很广泛，具体可以参考

http://www.it610.com/article/5215825.htm

https://blog.csdn.net/qq_14845119/article/details/52064928

1 解码

C++:

std::vector<RotatedRect> boxes;

std::vector<float> confidences;

decode(scores, geometry, confThreshold, boxes, confidences);

python:

[boxes, confidences] = decode(scores, geometry, confThreshold)

2 非极大值抑制

我们使用OpenCV函数NMSBoxes（C ++）或NMSBoxesRotated（Python）来过滤掉误报并获得最终预测。

C++:

std::vector<int> indices;

NMSBoxes(boxes, confidences, confThreshold, nmsThreshold, indices);

Python:

indices = cv.dnn.NMSBoxesRotated(boxes, confidences, confThreshold, nmsThreshold)

3结果和代码

3.1结果

在VS2017下运行了C++代码，其中OpenCV版本至少要3.4.5以上。不然模型读取会有问题。模型文件太大，见下载链接：

https://download.csdn.net/download/luohenyj/11003000

https://github.com/luohenyueji/OpenCV-Practical-Exercise

如果没有积分（系统自动设定资源分数）看看参考链接。我搬运过来的，大修改没有。

或者梯子直接下载模型：

https://www.dropbox.com/s/r2ingd0l3zt8hxs/frozen_east_text_detection.tar.gz?dl=1

结果如下，效果还不错，速度也还好。

3.2 代码

C++代码有所更改，python没有。对文本检测不熟悉，注释不多，但是实际代码不需要太大变化。

C++代码：

// text_detection.cpp : 此文件包含 "main" 函数。程序执行将在此处开始并结束。

//

#include "pch.h"

#include <iostream>

#include <opencv2/opencv.hpp>

using namespace std;

using namespace cv;

using namespace cv::dnn;

//解码

void decode(const Mat &scores, const Mat &geometry, float scoreThresh,

	std::vector<RotatedRect> &detections, std::vector<float> &confidences);

/**

 * @brief

 *

 * @param srcImg 检测图像

 * @param inpWidth 深度学习图像输入宽

 * @param inpHeight 深度学习图像输入高

 * @param confThreshold 置信度

 * @param nmsThreshold 非极大值抑制算法阈值

 * @param net

 * @return Mat

 */

Mat text_detect(Mat srcImg, int inpWidth, int inpHeight, float confThreshold, float nmsThreshold, Net net)

{

	//输出

	std::vector<Mat> output;

	std::vector<String> outputLayers(2);

	outputLayers[0] = "feature_fusion/Conv_7/Sigmoid";

	outputLayers[1] = "feature_fusion/concat_3";

	//检测图像

	Mat frame, blob;

	frame = srcImg.clone();

	//获取深度学习模型的输入

	blobFromImage(frame, blob, 1.0, Size(inpWidth, inpHeight), Scalar(123.68, 116.78, 103.94), true, false);

	net.setInput(blob);

	//输出结果

	net.forward(output, outputLayers);

	//置信度

	Mat scores = output[0];

	//位置参数

	Mat geometry = output[1];

	// Decode predicted bounding boxes， 对检测框进行解码，获取文本框位置方向

	//文本框位置参数

	std::vector<RotatedRect> boxes;

	//文本框置信度

	std::vector<float> confidences;

	decode(scores, geometry, confThreshold, boxes, confidences);

	// Apply non-maximum suppression procedure， 应用非极大性抑制算法

	//符合要求的文本框

	std::vector<int> indices;

	NMSBoxes(boxes, confidences, confThreshold, nmsThreshold, indices);

	// Render detections. 输出预测

	//缩放比例

	Point2f ratio((float)frame.cols / inpWidth, (float)frame.rows / inpHeight);

	for (size_t i = 0; i < indices.size(); ++i)

	{

		RotatedRect &box = boxes[indices[i]];

		Point2f vertices[4];

		box.points(vertices);

		//还原坐标点

		for (int j = 0; j < 4; ++j)

		{

			vertices[j].x *= ratio.x;

			vertices[j].y *= ratio.y;

		}

		//画框

		for (int j = 0; j < 4; ++j)

		{

			line(frame, vertices[j], vertices[(j + 1) % 4], Scalar(0, 255, 0), 2, LINE_AA);

		}

	}

	// Put efficiency information. 时间

	std::vector<double> layersTimes;

	double freq = getTickFrequency() / 1000;

	double t = net.getPerfProfile(layersTimes) / freq;

	std::string label = format("Inference time: %.2f ms", t);

	putText(frame, label, Point(0, 15), FONT_HERSHEY_SIMPLEX, 0.5, Scalar(0, 255, 0));

	return frame;

}

//模型地址

auto model = "./model/frozen_east_text_detection.pb";

//检测图像

auto detect_image = "./image/patient.jpg";

//输入框尺寸

auto inpWidth = 320;

auto inpHeight = 320;

//置信度阈值

auto confThreshold = 0.5;

//非极大值抑制算法阈值

auto nmsThreshold = 0.4;

int main()

{

	//读取模型

	Net net = readNet(model);

	//读取检测图像

	Mat srcImg = imread(detect_image);

	if (!srcImg.empty())

	{

		cout << "read image success!" << endl;

	}

	Mat resultImg = text_detect(srcImg, inpWidth, inpHeight, confThreshold, nmsThreshold, net);

	imshow("result", resultImg);

	waitKey();

	return 0;

}

/**

 * @brief 输出检测到的文本框相关信息

 *

 * @param scores 置信度

 * @param geometry 位置信息

 * @param scoreThresh 置信度阈值

 * @param detections 位置

 * @param confidences 分类概率

 */

void decode(const Mat &scores, const Mat &geometry, float scoreThresh,

	std::vector<RotatedRect> &detections, std::vector<float> &confidences)

{

	detections.clear();

	//判断是不是符合提取要求

	CV_Assert(scores.dims == 4);

	CV_Assert(geometry.dims == 4);

	CV_Assert(scores.size[0] == 1);

	CV_Assert(geometry.size[0] == 1);

	CV_Assert(scores.size[1] == 1);

	CV_Assert(geometry.size[1] == 5);

	CV_Assert(scores.size[2] == geometry.size[2]);

	CV_Assert(scores.size[3] == geometry.size[3]);

	const int height = scores.size[2];

	const int width = scores.size[3];

	for (int y = 0; y < height; ++y)

	{

		//识别概率

		const float *scoresData = scores.ptr<float>(0, 0, y);

		//文本框坐标

		const float *x0_data = geometry.ptr<float>(0, 0, y);

		const float *x1_data = geometry.ptr<float>(0, 1, y);

		const float *x2_data = geometry.ptr<float>(0, 2, y);

		const float *x3_data = geometry.ptr<float>(0, 3, y);

		//文本框角度

		const float *anglesData = geometry.ptr<float>(0, 4, y);

		//遍历所有检测到的检测框

		for (int x = 0; x < width; ++x)

		{

			float score = scoresData[x];

			//低于阈值忽略该检测框

			if (score < scoreThresh)

			{

				continue;

			}

			// Decode a prediction.

			// Multiple by 4 because feature maps are 4 time less than input image.

			float offsetX = x * 4.0f, offsetY = y * 4.0f;

			//角度及相关正余弦计算

			float angle = anglesData[x];

			float cosA = std::cos(angle);

			float sinA = std::sin(angle);

			float h = x0_data[x] + x2_data[x];

			float w = x1_data[x] + x3_data[x];

			Point2f offset(offsetX + cosA * x1_data[x] + sinA * x2_data[x],

				offsetY - sinA * x1_data[x] + cosA * x2_data[x]);

			Point2f p1 = Point2f(-sinA * h, -cosA * h) + offset;

			Point2f p3 = Point2f(-cosA * w, sinA * w) + offset;

			//旋转矩形，分别输入中心点坐标，图像宽高，角度

			RotatedRect r(0.5f * (p1 + p3), Size2f(w, h), -angle * 180.0f / (float)CV_PI);

			//保存检测框

			detections.push_back(r);

			//保存检测框的置信度

			confidences.push_back(score);

		}

	}

}

Python代码：

# Import required modules

import cv2 as cv

import math

import argparse

parser = argparse.ArgumentParser(description='Use this script to run text detection deep learning networks using OpenCV.')

# Input argument

parser.add_argument('--input', help='Path to input image or video file. Skip this argument to capture frames from a camera.')

# Model argument

parser.add_argument('--model', default="./model/frozen_east_text_detection.pb",

                    help='Path to a binary .pb file of model contains trained weights.'

                    )

# Width argument

parser.add_argument('--width', type=int, default=320,

                    help='Preprocess input image by resizing to a specific width. It should be multiple by 32.'

                   )

# Height argument

parser.add_argument('--height',type=int, default=320,

                    help='Preprocess input image by resizing to a specific height. It should be multiple by 32.'

                   )

# Confidence threshold

parser.add_argument('--thr',type=float, default=0.5,

                    help='Confidence threshold.'

                   )

# Non-maximum suppression threshold

parser.add_argument('--nms',type=float, default=0.4,

                    help='Non-maximum suppression threshold.'

                   )

args = parser.parse_args()

############ Utility functions ############

def decode(scores, geometry, scoreThresh):

    detections = []

    confidences = []

    ############ CHECK DIMENSIONS AND SHAPES OF geometry AND scores ############

    assert len(scores.shape) == 4, "Incorrect dimensions of scores"

    assert len(geometry.shape) == 4, "Incorrect dimensions of geometry"

    assert scores.shape[0] == 1, "Invalid dimensions of scores"

    assert geometry.shape[0] == 1, "Invalid dimensions of geometry"

    assert scores.shape[1] == 1, "Invalid dimensions of scores"

    assert geometry.shape[1] == 5, "Invalid dimensions of geometry"

    assert scores.shape[2] == geometry.shape[2], "Invalid dimensions of scores and geometry"

    assert scores.shape[3] == geometry.shape[3], "Invalid dimensions of scores and geometry"

    height = scores.shape[2]

    width = scores.shape[3]

    for y in range(0, height):

        # Extract data from scores

        scoresData = scores[0][0][y]

        x0_data = geometry[0][0][y]

        x1_data = geometry[0][1][y]

        x2_data = geometry[0][2][y]

        x3_data = geometry[0][3][y]

        anglesData = geometry[0][4][y]

        for x in range(0, width):

            score = scoresData[x]

            # If score is lower than threshold score, move to next x

            if(score < scoreThresh):

                continue

            # Calculate offset

            offsetX = x * 4.0

            offsetY = y * 4.0

            angle = anglesData[x]

            # Calculate cos and sin of angle

            cosA = math.cos(angle)

            sinA = math.sin(angle)

            h = x0_data[x] + x2_data[x]

            w = x1_data[x] + x3_data[x]

            # Calculate offset

            offset = ([offsetX + cosA * x1_data[x] + sinA * x2_data[x], offsetY - sinA * x1_data[x] + cosA * x2_data[x]])

            # Find points for rectangle

            p1 = (-sinA * h + offset[0], -cosA * h + offset[1])

            p3 = (-cosA * w + offset[0],  sinA * w + offset[1])

            center = (0.5*(p1[0]+p3[0]), 0.5*(p1[1]+p3[1]))

            detections.append((center, (w,h), -1*angle * 180.0 / math.pi))

            confidences.append(float(score))

    # Return detections and confidences

    return [detections, confidences]

if __name__ == "__main__":

    # Read and store arguments

    confThreshold = args.thr

    nmsThreshold = args.nms

    inpWidth = args.width

    inpHeight = args.height

    model = args.model

    # Load network

    net = cv.dnn.readNet(model)

    # Create a new named window

    kWinName = "EAST: An Efficient and Accurate Scene Text Detector"

    outputLayers = []

    outputLayers.append("feature_fusion/Conv_7/Sigmoid")

    outputLayers.append("feature_fusion/concat_3")

    # Read frame

    frame = cv.imread("./image/stop1.jpg")

    # Get frame height and width

    height_ = frame.shape[0]

    width_ = frame.shape[1]

    rW = width_ / float(inpWidth)

    rH = height_ / float(inpHeight)

    # Create a 4D blob from frame.

    blob = cv.dnn.blobFromImage(frame, 1.0, (inpWidth, inpHeight), (123.68, 116.78, 103.94), True, False)

    # Run the model

    net.setInput(blob)

    output = net.forward(outputLayers)

    t, _ = net.getPerfProfile()

    label = 'Inference time: %.2f ms' % (t * 1000.0 / cv.getTickFrequency())

    # Get scores and geometry

    scores = output[0]

    geometry = output[1]

    [boxes, confidences] = decode(scores, geometry, confThreshold)

    # Apply NMS

    indices = cv.dnn.NMSBoxesRotated(boxes, confidences, confThreshold,nmsThreshold)

    for i in indices:

        # get 4 corners of the rotated rect

        vertices = cv.boxPoints(boxes[i[0]])

        # scale the bounding box coordinates based on the respective ratios

        for j in range(4):

            vertices[j][0] *= rW

            vertices[j][1] *= rH

        for j in range(4):

            p1 = (vertices[j][0], vertices[j][1])

            p2 = (vertices[(j + 1) % 4][0], vertices[(j + 1) % 4][1])

            cv.line(frame, p1, p2, (0, 255, 0), 2, cv.LINE_AA);

            # cv.putText(frame, "{:.3f}".format(confidences[i[0]]), (vertices[0][0], vertices[0][1]), cv.FONT_HERSHEY_SIMPLEX, 0.5, (255, 0, 0), 1, cv.LINE_AA)

    # Put efficiency information

    cv.putText(frame, label, (0, 15), cv.FONT_HERSHEY_SIMPLEX, 0.5, (0, 0, 255))

    # Display the frame

    cv.imshow("result",frame)

    cv.waitKey(0)

参考

https://www.learnopencv.com/deep-learning-based-text-detection-using-opencv-c-python/

巴特西

[OpenCV实战]5 基于深度学习的文本检测

1 网络加载

2 读取图像

3 前向传播

4 处理输出

3结果和代码

3.1结果

3.2 代码

参考

最新文章

热门文章