Make WordPress Core

Opened 8 years ago

Last modified 8 weeks ago

#38044 new defect (bug)

Make seems_utf8() RFC 3629 compliant.

Reported by: gitlost's profile gitlost Owned by:
Milestone: Future Release Priority: normal
Severity: normal Version: 1.2.1
Component: Formatting Keywords: has-patch
Focuses: Cc:

Description

seems_utf8() should be made RFC 3629 compliant. Currently it accepts overlong sequences and surrogates, which will cause PHP functions expecting valid UTF-8 strings to fail.

Change History (3)

#1 @desrosj
6 years ago

  • Keywords needs-patch added
  • Milestone changed from Awaiting Review to Future Release

Related: #29717.

This ticket was mentioned in PR #7463 on WordPress/wordpress-develop by @debarghyabanerjee.


2 months ago
#2

  • Keywords has-patch added; needs-patch removed

Trac Ticket: Core-38044

## Overview

  • This pull request introduces the seems_utf8 function, which validates UTF-8 encoded strings according to the specifications outlined in RFC 3629. This implementation ensures that only valid UTF-8 sequences are accepted, effectively safeguarding applications against invalid input.

## Key Features:

  • UTF-8 Encoding Compliance:
  • The function adheres strictly to the UTF-8 encoding rules defined in RFC 3629, which allows for a maximum of 4 bytes per character.
  • Handling of Single and Multi-byte Sequences:
  • It correctly identifies and processes single-byte (0x00 to 0x7F) and multi-byte sequences (2 to 4 bytes), ensuring that each byte in a multi-byte sequence begins with the appropriate prefix.
  • Validation of Leading Bytes:
  • The function checks leading bytes to determine the number of continuation bytes required:
  • 0xC0 for 2-byte sequences
  • 0xE0 for 3-byte sequences
  • 0xF0 for 4-byte sequences
  • It explicitly rejects any leading bytes starting with 0xF8 or 0xFC, as these indicate sequences that exceed the valid UTF-8 range.
  • Control Over Overlong Sequences:
  • The function rejects overlong sequences, ensuring that the encoding does not use more bytes than necessary to represent a character, thereby preventing potential security issues.
  • Surrogate Pair Handling:
  • It prevents the inclusion of invalid surrogate pairs (U+D800 to U+DFFF) in the encoded string, in compliance with the restrictions specified in RFC 3629.
  • Zero Byte Validation:
  • The function checks for invalid overlong sequences specifically for U+0000, adhering to best practices for UTF-8 validation.
  • Comprehensive Error Handling:
  • Each check returns false for invalid cases, ensuring that any non-compliant string is effectively filtered out, thereby providing robustness against various encoding issues.

## Conclusion

  • The seems_utf8 function is a comprehensive implementation that ensures full compliance with RFC 3629 standards. By validating UTF-8 strings effectively, it enhances the integrity and security of applications that rely on proper character encoding. This pull request aims to integrate this functionality, providing developers with a reliable tool for UTF-8 validation.

@debarghyabanerjee commented on PR #7463:


8 weeks ago
#3

Hi @desrosj , can you please take a look into this PR. Thanks.

Note: See TracTickets for help on using tickets.