How to extract hardcoded subtitles from video

MrNice · June 28, 2023, 7:53am

Hello there,
I am looking for experienced users with hardcoded subtitles (hardsub).
My need is to extract hardcoded subtitles (french) from a mp4 video.
I found few scripts but it’s not very clear for me.
VideoSubFinder
video-subtitle-extractor
https://github.com/oliverfei/videocr-PaddleOCR
Maybe some other
What is the easier and best to use with Leap 15.5?
Any advice, howto is welcome.
Let me know if you need more info.
Many thanks

dcurtisfra · June 28, 2023, 10:02am

May I suggest that, you take a look at the VLC media player – there’s a plugin which seems to be installed as default named “VLsub” – simply open the menu for “View” and enable “VLsub”.

MrNice · June 28, 2023, 10:27am

Thank you for your suggestion.
I tried but found nothing. Look
here
However, the sub I want is not in all the sub sites I tried like
subscene, opensubtitles and more.
The sub is a French translation in Russian serial low resolution. I have a better resolution VO without sub.
So I want to extract from the low resolution to add to the other.

dcurtisfra · June 28, 2023, 1:17pm

Then, you could take a look at the video file with FFmpeg – it has a function which may be sufficient for what you’re trying to achieve: <https://trac.ffmpeg.org/wiki/ExtractSubtitles>.

MrNice · June 28, 2023, 1:39pm

Not at all.
ffmpeg can extract only sub tracks, not hardcoded sub.
I need a OCR that can “read” image text and convert in real text.
Have a look at the links I provided first.
I am looking for user with experience on how to install one of these programs, they are not in the repo as usual.

dcurtisfra · June 28, 2023, 2:54pm

Yes, I looked at those packages –

Please investigate if they’re available in Flatpak.
The reason is, such tools are better installed within the space of a specified user rather than, system wide …

Did you take a look at “ CCExtractor”?
<https://github.com/CCExtractor/ccextractor/releases>
<https://ccextractor.org/>

But, you’ll have to build it …
An RPM package isn’t available off the shelf …

MrNice · June 28, 2023, 3:26pm

I did a search before posting in Flatpck hub, not subtitle extractor, only OCR

I just had a look.
Unfortunately, CCExtractor is not a hardcoded sub extractor.

FYI, the processes of hardcoded sub extractor are to take a picture of the sub in the image (image processing), record the start time and stop time in the video, then process the image with the OCR, check the output words in a dictionary then write the text with the times in a file in the right format, usually .srt file.

hendersj · June 28, 2023, 4:44pm

Perhaps it would help others help you if you posted the information from:

ffmpeg -i <filename>

MrNice · June 28, 2023, 4:51pm

I think you mean

ffprobe -i filename

So,

> ffprobe -i "17 moments du printemps 1_12.mp4"
ffprobe version 4.4.4 Copyright (c) 2007-2023 the FFmpeg developers
  built with gcc 7 (SUSE Linux)
  configuration: --prefix=/usr --libdir=/usr/lib64 --shlibdir=/usr/lib64 --incdir=/usr/include/ffmpeg --extra-cflags='-fmessage-length=0 -grecord-gcc-switches -O2 -Wall -D_FORTIFY_SOURCE=2 -fstack-protector-strong -funwind-tables -fasynchronous-unwind-tables -fstack-clash-protection -g' --optflags='-fmessage-length=0 -grecord-gcc-switches -O2 -Wall -D_FORTIFY_SOURCE=2 -fstack-protector-strong -funwind-tables -fasynchronous-unwind-tables -fstack-clash-protection -g' --disable-htmlpages --enable-pic --disable-stripping --enable-shared --disable-static --enable-gpl --enable-version3 --disable-openssl --enable-avresample --enable-gnutls --enable-ladspa --enable-libass --enable-libbluray --enable-libbs2b --enable-libcelt --enable-libcdio --enable-libdav1d --enable-libdc1394 --enable-libdrm --enable-libfontconfig --enable-libfreetype --enable-libfribidi --enable-libgsm --enable-libjack --enable-libmp3lame --enable-libopenjpeg --enable-libopenmpt --enable-libopus --enable-libpulse --enable-libspeex --enable-libssh --enable-libtheora --enable-libtwolame --enable-libvidstab --enable-libvmaf --enable-libvorbis --enable-libv4l2 --enable-libvpx --enable-libwebp --enable-libxml2 --enable-libzimg --enable-libzvbi --enable-libmfx --enable-vaapi --enable-vdpau --enable-version3 --enable-libfdk-aac-dlopen --enable-nonfree --enable-libvo-amrwbenc --enable-libx264 --enable-libx265 --enable-librtmp --enable-libxvid
  libavutil      56. 70.100 / 56. 70.100
  libavcodec     58.134.100 / 58.134.100
  libavformat    58. 76.100 / 58. 76.100
  libavdevice    58. 13.100 / 58. 13.100
  libavfilter     7.110.100 /  7.110.100
  libavresample   4.  0.  0 /  4.  0.  0
  libswscale      5.  9.100 /  5.  9.100
  libswresample   3.  9.100 /  3.  9.100
  libpostproc    55.  9.100 / 55.  9.100
Input #0, mov,mp4,m4a,3gp,3g2,mj2, from '17 moments du printemps 1_12.mp4':
  Metadata:
    major_brand     : isom
    minor_version   : 512
    compatible_brands: isomiso2avc1mp41
    encoder         : Lavf58.45.100
  Duration: 01:08:42.61, start: 0.000000, bitrate: 702 kb/s
  Stream #0:0(und): Video: h264 (Main) (avc1 / 0x31637661), yuv420p, 640x480 [SAR 1:1 DAR 4:3], 602 kb/s, 25 fps, 25 tbr, 90k tbn, 50 tbc (default)
    Metadata:
      handler_name    : VideoHandler
      vendor_id       : [0][0][0][0]
  Stream #0:1(und): Audio: aac (LC) (mp4a / 0x6134706D), 44100 Hz, mono, fltp, 93 kb/s (default)
    Metadata:
      handler_name    : SoundHandler
      vendor_id       : [0][0][0][0]

Stream #0:0: Video
Stream #0:1: Audio
No subtitle stream as it is hardcoded.

hui · June 28, 2023, 4:55pm

Here is a good description how to use it (don’t click on any download button on this page…):

No need to compile or build anything. Only make the *.run file executable and try if it works with Leap 15.5

hendersj · June 28, 2023, 5:09pm

ffmpeg should have worked for that as well, but either way, this info will help those trying to help you understand the file you’re working with.

As you say, it does seem that you need something that will handle OCR on the burned-in subtitles - I’m not aware of anything that does that, but maybe someone else will have an idea now that you’ve got the specs spelled out.

MrNice · June 28, 2023, 6:25pm

I downloaded it, unzip and cd.

/VideoSubFinder> ls
bitmaps               libnppicc.so.12              libopencv_imgproc.so.407    libtbb.so.2
Docs                  libnppig.so.12               libopencv_ml.so.407         libvidstab.so.1.1
finished.wav          libopencv_calib3d.so.407     libopencv_objdetect.so.407  libwx_baseu-3.2.so.0
libavcodec.so.58.134  libopencv_core.so.407        libopencv_photo.so.407      libwx_gtk3u_aui-3.2.so.0
libavfilter.so.7.110  libopencv_dnn.so.407         libopencv_stitching.so.407  libwx_gtk3u_core-3.2.so.0
libavformat.so.58.76  libopencv_features2d.so.407  libopencv_videoio.so.407    settings
libavresample.so.4.0  libopencv_flann.so.407       libopencv_video.so.407      VideoSubFinderWXW
libavutil.so.56.70    libopencv_gapi.so.407        libpostproc.so.55.9         VideoSubFinderWXW.run
libcudart.so.12       libopencv_highgui.so.407     libswresample.so.3.9
libnppc.so.12         libopencv_imgcodecs.so.407   libswscale.so.5.9

VideoSubFinderWXW.run can be run.

 ./VideoSubFinderWXW.run
./VideoSubFinderWXW: error while loading shared libraries: libjpeg.so.62: cannot open shared object file: No such file or directory

sudo zypper se libjpeg
Loading repository data...
Reading installed packages...

S | Name                  | Summary                                                               | Type
--+-----------------------+-----------------------------------------------------------------------+--------
  | libjpeg-turbo         | A SIMD-accelerated library for manipulating JPEG image files          | package
i | libjpeg8              | A SIMD-accelerated JPEG compression/decompression library             | package
  | libjpeg8-32bit        | A SIMD-accelerated JPEG compression/decompression library             | package
  | libjpeg8-devel        | Development Tools for applications which will use the Libjpeg Library | package
  | libjpeg8-devel-32bit  | Development Tools for applications which will use the Libjpeg Library | package
  | libjpeg62             | A SIMD-accelerated JPEG compression/decompression library             | package
  | libjpeg62-32bit       | A SIMD-accelerated JPEG compression/decompression library             | package
  | libjpeg62-devel       | Development Tools for applications which will use the Libjpeg Library | package
  | libjpeg62-devel-32bit | Development Tools for applications which will use the Libjpeg Library | package
  | libjpegxr0            | Open source implementation of jpegxr                                  | package

I can’t find what provide libjpeg.so.62
Any idea?

MrNice · June 28, 2023, 6:25pm

Thanks anyway heanders

hendersj · June 28, 2023, 7:01pm

It does strike me that you may need to do something like crop the video to the subtitle area - otherwise your OCR software may pick up non-subtitle text in the images and include that in the output.

Are the subtitles on a black background, or just superimposed over the image with a transparent background? (That will also make a big difference on the software’s ability to extract the subtitle text).

hendersj · June 28, 2023, 7:04pm

libjpeg62 is the package that should include that library.

MrNice · June 29, 2023, 4:20pm

After adding libjpeg62pkg, woking fine.
I followed the howto to get around 850 .jpeg files.
I had to check all of them because sometime pictures with high contrast are kept.
Now left around 530 files.
Then I installed tesseract + french.
Fast easy. As there is no sub background, so sometime the picture mixes with the text. After create sub, I get a sub file. Now, I have to correct the bad text.
A bit of work but not too much.

First time I did that, not too bad!
Many thanks for your help.

arvidjaar · June 29, 2023, 4:43pm

zypper search --provides libjpeg.so.62