Opened 8 years ago
Last modified 8 weeks ago
#38044 new defect (bug)
Make seems_utf8() RFC 3629 compliant.
Reported by: | gitlost | Owned by: | |
---|---|---|---|
Milestone: | Future Release | Priority: | normal |
Severity: | normal | Version: | 1.2.1 |
Component: | Formatting | Keywords: | has-patch |
Focuses: | Cc: |
Description
seems_utf8()
should be made RFC 3629 compliant. Currently it accepts overlong sequences and surrogates, which will cause PHP functions expecting valid UTF-8 strings to fail.
Change History (3)
#1
@
6 years ago
- Keywords needs-patch added
- Milestone changed from Awaiting Review to Future Release
This ticket was mentioned in PR #7463 on WordPress/wordpress-develop by @debarghyabanerjee.
2 months ago
#2
- Keywords has-patch added; needs-patch removed
Trac Ticket: Core-38044
## Overview
- This pull request introduces the seems_utf8 function, which validates UTF-8 encoded strings according to the specifications outlined in RFC 3629. This implementation ensures that only valid UTF-8 sequences are accepted, effectively safeguarding applications against invalid input.
## Key Features:
- UTF-8 Encoding Compliance:
- The function adheres strictly to the UTF-8 encoding rules defined in RFC 3629, which allows for a maximum of 4 bytes per character.
- Handling of Single and Multi-byte Sequences:
- It correctly identifies and processes single-byte (0x00 to 0x7F) and multi-byte sequences (2 to 4 bytes), ensuring that each byte in a multi-byte sequence begins with the appropriate prefix.
- Validation of Leading Bytes:
- The function checks leading bytes to determine the number of continuation bytes required:
- 0xC0 for 2-byte sequences
- 0xE0 for 3-byte sequences
- 0xF0 for 4-byte sequences
- It explicitly rejects any leading bytes starting with 0xF8 or 0xFC, as these indicate sequences that exceed the valid UTF-8 range.
- Control Over Overlong Sequences:
- The function rejects overlong sequences, ensuring that the encoding does not use more bytes than necessary to represent a character, thereby preventing potential security issues.
- Surrogate Pair Handling:
- It prevents the inclusion of invalid surrogate pairs (U+D800 to U+DFFF) in the encoded string, in compliance with the restrictions specified in RFC 3629.
- Zero Byte Validation:
- The function checks for invalid overlong sequences specifically for U+0000, adhering to best practices for UTF-8 validation.
- Comprehensive Error Handling:
- Each check returns false for invalid cases, ensuring that any non-compliant string is effectively filtered out, thereby providing robustness against various encoding issues.
## Conclusion
- The seems_utf8 function is a comprehensive implementation that ensures full compliance with RFC 3629 standards. By validating UTF-8 strings effectively, it enhances the integrity and security of applications that rely on proper character encoding. This pull request aims to integrate this functionality, providing developers with a reliable tool for UTF-8 validation.
@debarghyabanerjee commented on PR #7463:
8 weeks ago
#3
Hi @desrosj , can you please take a look into this PR. Thanks.
Note: See
TracTickets for help on using
tickets.
Related: #29717.