Make WordPress Core

Opened 9 years ago

Closed 6 years ago

#36456 closed defect (bug) (wontfix)

oEmbeds containing Emojis aren't cached if the meta_value column's charset is utf8 and not utf8mb4

Reported by: birgire's profile birgire Owned by:
Milestone: Priority: normal
Severity: normal Version: 4.4.2
Component: Emoji Keywords:
Focuses: Cc:

Description

oEmbeds (e.g. Instagram, Twitter, ... ) that contains Emojis aren't cached in the wp_postmeta table, if the meta_value column's charset is reported as utf8 and not utf8mb4.

This means that wp_oembed_get() is called for every such embed on each page load.

Here's an example of an Instagram containing an Emoji:

https://www.instagram.com/p/_cj9NXvhrh/

where only the:

_oembed_time_02ba39bcc45811ad672a288b772ca9a2

is written to the wp_postmeta table but not

_oembed_02ba39bcc45811ad672a288b772ca9a2

containing the cached oEmbed content.

I just noticed this for a site with a shared database server with MySQL version < 5.5.3.

In the WP_Embed::shortcode() method we have following:

// Maybe cache the result
if ( $html ) {
	update_post_meta( $post_ID, $cachekey, $html );
	update_post_meta( $post_ID, $cachekey_time, time() );
} elseif ( ! $cache ) {
	update_post_meta( $post_ID, $cachekey, '{{unknown}}' );
}

This doesn't take care of the situation when update_postmeta() fails to write to the database.

For new meta inserts it uses:

$result = $wpdb->insert( $table, array(
        $column => $object_id,
        'meta_key' => $meta_key,
        'meta_value' => $meta_value
    ) );

This is a wrapper for:

wpdb::_insert_replace_helper()

that contains this part:

$data = $this->process_fields( $table, $data, $format );
if ( false === $data ) {
    return false;
}

And similarly within wpdb::process_fields() we have:

$converted_data = $this->strip_invalid_text( $data );
if ( $data !== $converted_data ) {
    return false;
}

and this can return false for some data conversions.

Example


We can test this with a simple example:

update_post_meta( 123, 'test', 'Smile ????' );

that returns false if:

$wpdb->get_col_charset( $wpdb->postmeta, 'meta_value' ) 	

returns 'utf8' and not 'utf8mb4'.

In this case nothing is written to the wp_postmeta table.

Workaround

Here's what I constructed as a temporary workaround:

/**
 * Encode Emojis in all oembed results, if the charset for the meta_value column is utf8.
 */
add_filter( 'oembed_result', function( $html, $url, $args ) use ( $wpdb )
{	
	/**
	 * For WordPress 4.2+ where the get_col_charset method and wp_encode_emoji function where introduced.
	 * The charset check is based on the https://codex.wordpress.org/Function_Reference/wp_encode_emoji#Example
	 */
	if (   method_exists( $wpdb, 'get_col_charset' ) 
		 && 'utf8' === $wpdb->get_col_charset( $wpdb->postmeta, 'meta_value' ) 
	 )
		$html = wp_encode_emoji( $html );
		
	return $html;
}, 10, 3 );

Suggestions

I see two possibilities to rewrite this part:

// Maybe cache the result
if ( $html ) {
	update_post_meta( $post_ID, $cachekey, $html );
	update_post_meta( $post_ID, $cachekey_time, time() );
} elseif ( ! $cache ) {
	update_post_meta( $post_ID, $cachekey, '{{unknown}}' );
}

to fix the caching.

Suggestion 1)

// Maybe cache the result
if ( $html ) {

    $updated_cachekey = update_post_meta( $post_ID, $cachekey, $html );

    if( $updated_cachekey ) {
        update_post_meta( $post_ID, $cachekey_time, time() );
     } else {
        update_post_meta( $post_ID, $cachekey, '{{unknown}}' );
     }

 } elseif ( ! $cache ) {
     update_post_meta( $post_ID, $cachekey, '{{unknown}}' );
 }

This means that only the link will show up if $updated_cachekey is false and the cache key for the time isn't written to the db as well.

We could also rewrite this to only contain a single update_post_meta() call.

Suggestion 2)

We could also extend 1) to encode the Emojis if needed:

// Maybe cache the result
if ( $html ) {

    // Encode Emojis if the charset of the meta_value column is 'utf8'
    if (    method_exists( $wpdb, 'get_col_charset' ) 
         && 'utf8' === $wpdb->get_col_charset( $wpdb->postmeta, 'meta_value' )
         && function_exists( 'wp_encode_emoji' )
    ) {
        $html = wp_encode_emoji( $html );
    }

    $updated_cachekey = update_post_meta( $post_ID, $cachekey, $html );

    if( $updated_cachekey ) {
        update_post_meta( $post_ID, $cachekey_time, time() );
     } else {
        update_post_meta( $post_ID, $cachekey, '{{unknown}}' );
     }

 } elseif ( ! $cache ) {
     update_post_meta( $post_ID, $cachekey, '{{unknown}}' );
 }

where we encode the Emojis when the meta_value column's charset is utf8. The method/function checks would be needed since they are only available in WP 4.2+.

This should give us the cached content instead of only the link.

PS: For the above case the Query Monitor plugin reports 4 HTTP GET requests for a single instagram oEmbed url, that contains an Emoji.

Example:

For https://www.instagram.com/p/_cj9NXvhrh/ they are:

https://api.instagram.com/oembed
?maxwidth=640
&maxheight=960
&url=https%3A%2F%2Fwww.instagram.com%2Fp%2F_cj9NXvhrh%2F
&format=json

https://api.instagram.com/publicapi/oembed/
?maxwidth=640
&maxheight=960
&url=https%3A%2F%2Fwww.instagram.com%2Fp%2F_cj9NXvhrh%2F
&format=json

https://www.instagram.com/publicapi/oembed/
?maxwidth=640
&maxheight=960
&url=https://www.instagram.com/p/_cj9NXvhrh/
&format=json

https://api.instagram.com/oembed/
?maxwidth=640
&maxheight=960
&url=https://www.instagram.com/p/_cj9NXvhrh/
&format=json

I'm not sure why but I only get a single request for a Twitter oEmbed containing Emojis.

PPS: Sorry for the long post.

Change History (9)

#1 @birgire
9 years ago

ps: regarding the two suggestions above, it looks like I forgot that update_post_meta also returns false when trying to update the same existing meta value. So we need to take that into account as well ;-)

Last edited 9 years ago by birgire (previous) (diff)

#2 @birgire
9 years ago

We could additionally look at the $cache value to see if $cachekey already exists.

The two updated suggestions are now:

1)

// Maybe cache the result
if ( $html ) {

    $updated_cachekey = update_post_meta( $post_ID, $cachekey, $html );

    if( ! $updated_cachekey && ! $cache ) {
        update_post_meta( $post_ID, $cachekey, '{{unknown}}' );
    }

    update_post_meta( $post_ID, $cachekey_time, time() );

} elseif ( ! $cache ) {
    update_post_meta( $post_ID, $cachekey, '{{unknown}}' );
}

or 2)

// Maybe cache the result
if ( $html ) {

    global $wpdb;

    // Encode Emojis if the charset of the meta_value column is 'utf8' instead of 'utf8mb4'
    if (    method_exists( $wpdb, 'get_col_charset' ) 
         && 'utf8' === $wpdb->get_col_charset( $wpdb->postmeta, 'meta_value' )
         && function_exists( 'wp_encode_emoji' )
    ) {
        $html = wp_encode_emoji( $html );
    }

    $updated_cachekey = update_post_meta( $post_ID, $cachekey, $html );

    if( ! $updated_cachekey && ! $cache ) {
        update_post_meta( $post_ID, $cachekey, '{{unknown}}' );
    }

    update_post_meta( $post_ID, $cachekey_time, time() );

} elseif ( ! $cache ) {
    update_post_meta( $post_ID, $cachekey, '{{unknown}}' );
}

This ticket was mentioned in Slack in #core by jeremyfelt. View the logs.


9 years ago

#4 @Krstarica
9 years ago

Any update on this?

These days almost everybody is using Emojis, meaning oembeds are not cached at all, making high load on web server constantly fetching embedded content.

Version 0, edited 9 years ago by Krstarica (next)

#5 follow-up: @birgire
9 years ago

@Krstarica, I think this is still problematic for non utf8mb4.

Just imagine a popular site, with an uncached frontpage, displaying full content posts, each containing multiple Instagram oEmbeds with Emojis ;-)

Note that the Instagram oEmbeds can contain other invalid chars than just Emojis, so I ended up using mb_convert_encoding(), like mentioned here: http://stackoverflow.com/a/34640424/2078474, instead of using wp_encode_emoji().

Then JetPack complicated things further, because it's using it's own Instagram handler, if I remember correctly. I therefore had to unload it through the jetpack_shortcodes_to_include filter, to be able to apply the mb_convert_encoding() convertion through the oembed_result filter.

Even if we strip the invalid chars from the fetched Instagram HTML, the javascript part will display it correctly.

Last edited 9 years ago by birgire (previous) (diff)

#6 in reply to: ↑ 5 @Krstarica
9 years ago

@birgire, we've ended up with manually converting Wordpress database, tables and config file to utf8mb4. This should have been done during upgrade process, but for some reason it didn't happen:
https://make.wordpress.org/core/2015/04/02/the-utf8mb4-upgrade/

#7 @birgire
9 years ago

@Krstarica yes I would recommend upgrading to utf8mb4 whenever that's supported, but that's not always the case. Good to hear you were able to do that.

#8 @johnbillion
9 years ago

  • Component changed from Embeds to Emoji

#9 @pento
6 years ago

  • Resolution set to wontfix
  • Status changed from new to closed

Given that only a tiny percentage of sites run a version of MySQL that doesn't support utf8mb4, and that we'll likely bump the minimum required version of MySQL at some point in the not-too-distant future, I'm going to close this issue.

The correct fix is to convert the table to utf8mb4.

Note: See TracTickets for help on using tickets.