Make WordPress Core

Opened 9 years ago

Closed 6 years ago

#36456 closed defect (bug) (wontfix)

oEmbeds containing Emojis aren't cached if the meta_value column's charset is utf8 and not utf8mb4

Reported by: birgire's profile birgire Owned by:
Milestone: Priority: normal
Severity: normal Version: 4.4.2
Component: Emoji Keywords:
Focuses: Cc:

Description

oEmbeds (e.g. Instagram, Twitter, ... ) that contains Emojis aren't cached in the wp_postmeta table, if the meta_value column's charset is reported as utf8 and not utf8mb4.

This means that wp_oembed_get() is called for every such embed on each page load.

Here's an example of an Instagram containing an Emoji:

https://www.instagram.com/p/_cj9NXvhrh/

where only the:

_oembed_time_02ba39bcc45811ad672a288b772ca9a2

is written to the wp_postmeta table but not

_oembed_02ba39bcc45811ad672a288b772ca9a2

containing the cached oEmbed content.

I just noticed this for a site with a shared database server with MySQL version < 5.5.3.

In the WP_Embed::shortcode() method we have following:

// Maybe cache the result
if ( $html ) {
	update_post_meta( $post_ID, $cachekey, $html );
	update_post_meta( $post_ID, $cachekey_time, time() );
} elseif ( ! $cache ) {
	update_post_meta( $post_ID, $cachekey, '{{unknown}}' );
}

This doesn't take care of the situation when update_postmeta() fails to write to the database.

For new meta inserts it uses:

$result = $wpdb->insert( $table, array(
        $column => $object_id,
        'meta_key' => $meta_key,
        'meta_value' => $meta_value
    ) );

This is a wrapper for:

wpdb::_insert_replace_helper()

that contains this part:

$data = $this->process_fields( $table, $data, $format );
if ( false === $data ) {
    return false;
}

And similarly within wpdb::process_fields() we have:

$converted_data = $this->strip_invalid_text( $data );
if ( $data !== $converted_data ) {
    return false;
}

and this can return false for some data conversions.

Example


We can test this with a simple example:

update_post_meta( 123, 'test', 'Smile ????' );

that returns false if:

$wpdb->get_col_charset( $wpdb->postmeta, 'meta_value' ) 	

returns 'utf8' and not 'utf8mb4'.

In this case nothing is written to the wp_postmeta table.

Workaround

Here's what I constructed as a temporary workaround:

/**
 * Encode Emojis in all oembed results, if the charset for the meta_value column is utf8.
 */
add_filter( 'oembed_result', function( $html, $url, $args ) use ( $wpdb )
{	
	/**
	 * For WordPress 4.2+ where the get_col_charset method and wp_encode_emoji function where introduced.
	 * The charset check is based on the https://codex.wordpress.org/Function_Reference/wp_encode_emoji#Example
	 */
	if (   method_exists( $wpdb, 'get_col_charset' ) 
		 && 'utf8' === $wpdb->get_col_charset( $wpdb->postmeta, 'meta_value' ) 
	 )
		$html = wp_encode_emoji( $html );
		
	return $html;
}, 10, 3 );

Suggestions

I see two possibilities to rewrite this part:

// Maybe cache the result
if ( $html ) {
	update_post_meta( $post_ID, $cachekey, $html );
	update_post_meta( $post_ID, $cachekey_time, time() );
} elseif ( ! $cache ) {
	update_post_meta( $post_ID, $cachekey, '{{unknown}}' );
}

to fix the caching.

Suggestion 1)

// Maybe cache the result
if ( $html ) {

    $updated_cachekey = update_post_meta( $post_ID, $cachekey, $html );

    if( $updated_cachekey ) {
        update_post_meta( $post_ID, $cachekey_time, time() );
     } else {
        update_post_meta( $post_ID, $cachekey, '{{unknown}}' );
     }

 } elseif ( ! $cache ) {
     update_post_meta( $post_ID, $cachekey, '{{unknown}}' );
 }

This means that only the link will show up if $updated_cachekey is false and the cache key for the time isn't written to the db as well.

We could also rewrite this to only contain a single update_post_meta() call.

Suggestion 2)

We could also extend 1) to encode the Emojis if needed:

// Maybe cache the result
if ( $html ) {

    // Encode Emojis if the charset of the meta_value column is 'utf8'
    if (    method_exists( $wpdb, 'get_col_charset' ) 
         && 'utf8' === $wpdb->get_col_charset( $wpdb->postmeta, 'meta_value' )
         && function_exists( 'wp_encode_emoji' )
    ) {
        $html = wp_encode_emoji( $html );
    }

    $updated_cachekey = update_post_meta( $post_ID, $cachekey, $html );

    if( $updated_cachekey ) {
        update_post_meta( $post_ID, $cachekey_time, time() );
     } else {
        update_post_meta( $post_ID, $cachekey, '{{unknown}}' );
     }

 } elseif ( ! $cache ) {
     update_post_meta( $post_ID, $cachekey, '{{unknown}}' );
 }

where we encode the Emojis when the meta_value column's charset is utf8. The method/function checks would be needed since they are only available in WP 4.2+.

This should give us the cached content instead of only the link.

PS: For the above case the Query Monitor plugin reports 4 HTTP GET requests for a single instagram oEmbed url, that contains an Emoji.

Example:

For https://www.instagram.com/p/_cj9NXvhrh/ they are:

https://api.instagram.com/oembed
?maxwidth=640
&maxheight=960
&url=https%3A%2F%2Fwww.instagram.com%2Fp%2F_cj9NXvhrh%2F
&format=json

https://api.instagram.com/publicapi/oembed/
?maxwidth=640
&maxheight=960
&url=https%3A%2F%2Fwww.instagram.com%2Fp%2F_cj9NXvhrh%2F
&format=json

https://www.instagram.com/publicapi/oembed/
?maxwidth=640
&maxheight=960
&url=https://www.instagram.com/p/_cj9NXvhrh/
&format=json

https://api.instagram.com/oembed/
?maxwidth=640
&maxheight=960
&url=https://www.instagram.com/p/_cj9NXvhrh/
&format=json

I'm not sure why but I only get a single request for a Twitter oEmbed containing Emojis.

PPS: Sorry for the long post.

Change History (9)

#1 @birgire
9 years ago

ps: regarding the two suggestions above, it looks like I forgot that update_post_meta also returns false when trying to update the same existing meta value. So we need to take that into account as well ;-)

Last edited 9 years ago by birgire (previous) (diff)

#2 @birgire
9 years ago

We could additionally look at the $cache value to see if $cachekey already exists.

The two updated suggestions are now:

1)

// Maybe cache the result
if ( $html ) {

    $updated_cachekey = update_post_meta( $post_ID, $cachekey, $html );

    if( ! $updated_cachekey && ! $cache ) {
        update_post_meta( $post_ID, $cachekey, '{{unknown}}' );
    }

    update_post_meta( $post_ID, $cachekey_time, time() );

} elseif ( ! $cache ) {
    update_post_meta( $post_ID, $cachekey, '{{unknown}}' );
}

or 2)

// Maybe cache the result
if ( $html ) {

    global $wpdb;

    // Encode Emojis if the charset of the meta_value column is 'utf8' instead of 'utf8mb4'
    if (    method_exists( $wpdb, 'get_col_charset' ) 
         && 'utf8' === $wpdb->get_col_charset( $wpdb->postmeta, 'meta_value' )
         && function_exists( 'wp_encode_emoji' )
    ) {
        $html = wp_encode_emoji( $html );
    }

    $updated_cachekey = update_post_meta( $post_ID, $cachekey, $html );

    if( ! $updated_cachekey && ! $cache ) {
        update_post_meta( $post_ID, $cachekey, '{{unknown}}' );
    }

    update_post_meta( $post_ID, $cachekey_time, time() );

} elseif ( ! $cache ) {
    update_post_meta( $post_ID, $cachekey, '{{unknown}}' );
}

This ticket was mentioned in Slack in #core by jeremyfelt. View the logs.


9 years ago

#4 @Krstarica
9 years ago

Any update on this?

These days almost everybody is using Emojis, meaning oembeds are not cached at all, making high load on web server constantly fetching embedded content.

@birgire temporary workaround works for some embeds, but doesn't work for e.g. https://www.instagram.com/p/BIaw9JUB-DK/

Last edited 9 years ago by Krstarica (previous) (diff)

#5 follow-up: @birgire
9 years ago

@Krstarica, I think this is still problematic for non utf8mb4.

Just imagine a popular site, with an uncached frontpage, displaying full content posts, each containing multiple Instagram oEmbeds with Emojis ;-)

Note that the Instagram oEmbeds can contain other invalid chars than just Emojis, so I ended up using mb_convert_encoding(), like mentioned here: http://stackoverflow.com/a/34640424/2078474, instead of using wp_encode_emoji().

Then JetPack complicated things further, because it's using it's own Instagram handler, if I remember correctly. I therefore had to unload it through the jetpack_shortcodes_to_include filter, to be able to apply the mb_convert_encoding() convertion through the oembed_result filter.

Even if we strip the invalid chars from the fetched Instagram HTML, the javascript part will display it correctly.

Last edited 9 years ago by birgire (previous) (diff)

#6 in reply to: ↑ 5 @Krstarica
9 years ago

@birgire, we've ended up with manually converting Wordpress database, tables and config file to utf8mb4. This should have been done during upgrade process, but for some reason it didn't happen:
https://make.wordpress.org/core/2015/04/02/the-utf8mb4-upgrade/

#7 @birgire
9 years ago

@Krstarica yes I would recommend upgrading to utf8mb4 whenever that's supported, but that's not always the case. Good to hear you were able to do that.

#8 @johnbillion
9 years ago

  • Component changed from Embeds to Emoji

#9 @pento
6 years ago

  • Resolution set to wontfix
  • Status changed from new to closed

Given that only a tiny percentage of sites run a version of MySQL that doesn't support utf8mb4, and that we'll likely bump the minimum required version of MySQL at some point in the not-too-distant future, I'm going to close this issue.

The correct fix is to convert the table to utf8mb4.

Note: See TracTickets for help on using tickets.