WordPress.org

Make WordPress Core

Opened 6 years ago

Closed 3 years ago

#25387 closed defect (bug) (fixed)

Autoembeds don't work with paragraphs

Reported by: Looimaster Owned by: azaozz
Milestone: 4.6 Priority: normal
Severity: normal Version: 3.6
Component: Embeds Keywords: has-patch needs-testing early
Focuses: Cc:

Description

I'm writing the below pieces of code in "Text" mode of the posts editor.

Does work:

http://www.youtube.com/watch?v=xxxxxxxxxxx

Does work:

<p>test
http://www.youtube.com/watch?v=xxxxxxxxxxx
test</p>

Doesn't work:

<p>http://www.youtube.com/watch?v=xxxxxxxxxxx</p>

Attachments (10)

25387.diff (585 bytes) - added by seanchayes 5 years ago.
25387-sample-oembed-content.jpg (135.0 KB) - added by seanchayes 5 years ago.
25387.2.diff (5.1 KB) - added by seanchayes 5 years ago.
25387.3.diff (5.1 KB) - added by wonderboymusic 5 years ago.
25387.4.diff (5.8 KB) - added by seanchayes 4 years ago.
25387.patch (2.2 KB) - added by iseulde 3 years ago.
25387.2.patch (2.2 KB) - added by iseulde 3 years ago.
25387.3.patch (2.3 KB) - added by azaozz 3 years ago.
25387.4.patch (2.6 KB) - added by iseulde 3 years ago.
25387.5.patch (2.7 KB) - added by azaozz 3 years ago.

Download all attachments as: .zip

Change History (51)

#1 @knutsp
6 years ago

Also when the visual editor adds something like:

<span style="some:styling;">http://www.youtube.com/watch?v=xxxxxxxxxxx</span>

it doesn't work.

#2 @redsweater
6 years ago

I understand the current matching is very conservative on purpose, as that was the tradeoff when turning on autoembed by default. But for the examples listed above and many others, I think the current pattern could be loosened up a bit to match many, many cases where the desired autoembedding would in fact be preferable to leaving the URL raw.

For the sake of argument here is one proposed alternative regex to the one currently used in class-wp-embed.php's autoembed() function:

(?:(?:^[^\s"']*\s*>?)|(?:\s))(https?://[^\s<]+)

I am not a regex expert, but i cobbled this together. I also ran it against several test cases where it seems to obtain the desired behavior. In each of these example cases, it matches and extracts the desired URL and nothing surrounding it:

<p>
http://www.wordpress.org/ 
</p>
<p>http://www.wordpress.org/</p>
Some text
http://www.wordpress.org/
Some more text
<p> http://www.wordpress.org/</p>
<p>http://www.apple.com/ </p>
http://www.red-sweater.com/
http://www.wordpress.org/

Note that the pattern does NOT match obvious examples I could think of where a literal URL would be expected to be left alone, for example:

<a href="http://www.wordpress.org">WordPress</a>

On the other hand, there are some areas where a false-ID will still be made. For example if the HTML above were slightly malformed and possessed spaces around the URL:

<a href=" http://www.wordpress.org ">WordPress</a>

In the end I think we should weigh the current frustration over the malfunctioning behavior against the possible frustration of falsely converting URLs. It seems to me that so long as the major cases where the URL is clearly intended to be part of markup are covered, there will be very few frustratingly unwanted substitutions. That said, I get that without an option to disable the feature, one frustrating subsitution could be one too many.

In short it feels to me that the current substitution behavior is not "magical" enough to justify its own existence. Requiring an author to meet all the caveats of the auto-embed regex match as it stands today seems about as onerous as requiring them to put the URL into literal [embed] shortcodes. If WordPress is going to support this laudible feature then I think it should be expanded to affect many more obvious substitution scenarios.

#3 @redsweater
6 years ago

  • Cc jalkut@… added

#4 @nacin
5 years ago

  • Component changed from General to Embeds
  • Milestone changed from Awaiting Review to 3.9

Would be nice to fix this. Thanks for the thorough analysis, redsweater.

#5 @seanchayes
5 years ago

This is my first punt at regex and I have attached a patch and a screenshot with the content I tested against.

In essence I started by working from redsweaters regex, found I kept getting an error when I was testing and I just couldn't deduce why. This is what I got:

preg_replace_callback(): Unknown modifier '(' ...

So, I modified the original regex and it's in the patch and here:

|(?<!")(https?://[^\s"\[<]+)|im

Along the way I did fairly basic testing with YouTube, Vimeo, WordPress.tv and Slideshare urls spread about some content (within tags and anchors as well as on separate lines and within embeds) I crafted to identify where a url was successfully parsed and embedded vs left alone.

Long story short, based on my testing, I could resolve the issue reported in this ticket, but I expect more testing is required to cover more scenarios and perhaps someone with a little more experience (and time) with regex might offer guidance on the pattern used.

@seanchayes
5 years ago

#6 @seanchayes
5 years ago

  • Keywords has-patch needs-testing added

#7 @SergeyBiryukov
5 years ago

  • Keywords needs-unit-tests added

#8 @seanchayes
5 years ago

I've added a revised patch 25387.2.patch with a unit test (my first).

Patch is now more comprehensive looking for urls and assessing the patterns of text/markup around the url. By no means can it be considered fully comprehensive just more comprehensive than we have now.

Based on the original bug report and on the feedback from redsweater and knutsp it should move us forward.

Guidance welcome if the unit test is not up to scratch.

@seanchayes
5 years ago

#9 @nacin
5 years ago

  • Keywords has-unit-tests added; needs-unit-tests removed

seanchayes, at a glance, the patch and unit tests look good. One thing I'll note is we use tabs rather than spaces for indentation. But that's the only thing I noticed.

#10 @nacin
5 years ago

  • Keywords commit 4.0-early added
  • Milestone changed from 3.9 to Future Release

I'm really concerned about the potential for breakage here. I think the tests are great but I'd like to make sure this gets sufficient real-world testing. Let's make this happen in early 4.0.

#11 @wonderboymusic
5 years ago

  • Keywords has-unit-tests commit removed
  • Milestone changed from Future Release to 4.0

25387.3.diff is a refresh. The last assertion fails, and Tests_Post_Output::test_the_content_shortcode() fails when running the whole suite. I didn't fix the tests, they still need a look.

#12 @MikeLittle
5 years ago

Testing this bug on revision r28585. The issue is already fixed on the front end.
However the preview in the edit screen only works for examples 1 and 3.
Applying the patch does not fix this new variation of the problem or break the front end view.

This ticket was mentioned in IRC in #wordpress-dev by DrewAPicture. View the logs.


5 years ago

#14 @DrewAPicture
5 years ago

  • Keywords 4.1-early added; 4.0-early removed
  • Milestone changed from 4.0 to Future Release

I agree with comment:10 about the real potential for breakage. Something like this should really go in at the beginning of a cycle, not the end. Punting for now.

#16 @iseulde
4 years ago

  • Keywords 4.1-early removed

Why does this not run after wpautop? Then we can look at the paragraph tags just like we do for the previews.

#17 @azaozz
4 years ago

Don't forget that pasting a URL and having it auto-embed is a shortcut. The "standard" way of doing this is to open the media modal and paste the URL on the "Insert from URL" tab. Then it is added to the post content wrapped in a [embed] shortcode.

This is the same for the Text editor: if embedding an URL, wrap it in [embed] or insert it with the media modal.

The shortcut that will auto-embed an URL is more of an "advanced user feature" and has some restrictions. The URL has to be pasted in a new paragraph or separated by new lines.

From the tests above, the only user case we should support is when an URL is wrapped in <p>. The rest should really be using the shortcode, it works perfectly.

Last edited 4 years ago by azaozz (previous) (diff)

#18 @knutsp
4 years ago

  • Version changed from 3.6.1 to 3.6

From the tests above, the only user case we should support is when an URL is wrapped in <p>.

Anyone?

#19 @seanchayes
4 years ago

Refreshing the patch with the current revision (33869)

The new patch is 25387.4.diff

And, I went out on a limb and added an extra autoembed_callback specifically for inline media url paragraph based checks and added that extra call into the autoembed function in class-wp-embed.php. This specific test is to identify media urls within <p> and </p>.

I updated the tests but there's still a failure that I can't reconcile :-(

However, the media embeds do work in an improved manner within the Preview Changes view and within Visual vs Text edit views.

Hopefully a step forward.

UPDATE: The reason for the additional and separate callback allowed the tests to pass and the additional functionality offered by this patch to co-exist.

CORRECTION: But I need to review and update the tests and upload the patch again.

Last edited 4 years ago by seanchayes (previous) (diff)

@seanchayes
4 years ago

#20 @seanchayes
4 years ago

What is everyone else seeing using current revision r33902? I'm not saying this revision changed the behavior just that I can't easily replicate the original scenario of this ticket.

I've noticed that after creating/editing in Text switching to Visual knocks out the paragraph tags upon return to Text.

So, in this scenario, when I wrap a YouTube (or DailyMotion) url in <p></p> tags in Text and then switch to Visual and then switch back to Text the url is displayed without the wrapping <p> tag, exists on it's own line and is processed just like a regular embed.

For that matter any content that was also wrapped in paragraph tags also loses the tags in the switch.

So I'm not sure we are adding any value in this patch/ticket and am interested if anyone agrees or there's something else I'm missing perhaps.

#21 @redsweater
4 years ago

@seanchayes I don't think you should focus on whether the switch from Text to Visual and back happens to generate the desired outcome, that's sort of a side case to the main gist of this case IMHO, which is that fundamentally at the core the autoembed mechanism fails to generate the desired embeds with specific content.

Consider that the content for any given blog post may never pass through the WordPress admin panel's "Visual" editor. It can't be assumed. Here is an easy way to reproduce the existing bug with top of trunk WordPress from the web admin screen:

  1. Create a new post.
  2. Switch to Text mode editing.
  3. Type or paste:
    <p>https://www.youtube.com/watch?v=Mb9dL0cb6ho</p>
    
  4. Publish.

Also please consider that simply tacking some behavior on to the Text editor in WordPress's web admin panel is not a sufficient fix, because the content may also have come from an automated source within the blog (e.g. from a plugin, etc), or it may have been submitted through the WordPress XMLRPC API or REST API.

The fix has to be in the core autoembed detection/substitution mechanism to be complete.

#22 @swissspidy
4 years ago

#34349 was marked as a duplicate.

#23 @iseulde
3 years ago

  • Keywords early added
  • Milestone changed from Future Release to 4.6

Let's try to fix this.

This ticket was mentioned in Slack in #core-editor by iseulde. View the logs.


3 years ago

#25 follow-up: @azaozz
3 years ago

This is a good place to use the "tests first" approach:

  1. Decide exactly what the code has to do:
    • Match an URL when it is on a separate line. The current code does this.
    • Match an URL when it is the only content of a <p> tag excluding the actual tag and possibly matching line breaks before/after.
    • Anything else?
  2. Write unit tests for all cases.
  3. Write code that passes all unit tests.

This also looks like a "good first patch" for somebody familiar with regex.

Last edited 3 years ago by azaozz (previous) (diff)

#26 in reply to: ↑ 25 @redsweater
3 years ago

Replying to azaozz:

This is a good place to use the "tests first" approach:

  1. Decide exactly what the code has to do:
    • Match an URL when it is on a separate line. The current code does this.
    • Match an URL when it is the only content of a <p> tag excluding the actual tag and possibly matching line breaks before/after.
    • Anything else?

I had to skim back to find your comment supporting the idea that the bare, non-shortcode URL should only work inside <p> tags. I'm not sure I agree with this reasoning at face value. The idea that it's a "shortcut" that should only work in a handful of very explicit scenarios is at odds with the reasoning that it's "magic," provides user delight, and should work in as many cases as can be confidently confirmed as safe :) I object slightly to the suggestion that there is a right way and a wrong way to insert these links. If the "shortcut" is the wrong way, then maybe it should be removed entirely to avoid ambiguity about the behavior?

I think the shortcut is a good user experience, and to that end I would refer back to my original comments where I outlined some other situations where I think it should work: https://core.trac.wordpress.org/ticket/25387?replyto=25#comment:2.

At the very least I think a URL, standing alone on a line without paragraph tags is unambiguous enough to warrant receiving the treatment.

#27 follow-up: @azaozz
3 years ago

...(this) should work in as many cases as can be confidently confirmed as safe
I think the shortcut is a good user experience, and to that end I would refer back to my original comments where I outlined some other situations where I think it should work

Right, that's the main reason why I think we should set up how this should work exactly, and that it is a good idea to make all tests first, then code it.

Currently (and for the last three years) this "shortcut" requires that the users paste (or type?) an "embeddable" URL on its own line. Because of wpautop, spaces and tabs (but not \n) before and after the URL are stripped and ignored. However this doesn't work when autop is disabled or when the URL is wrapped in (any) tags.

The examples in https://core.trac.wordpress.org/ticket/25387#comment:2 are all cases where this should work (although the first one is a bit odd):

<p>
http://www.wordpress.org/ 
</p>

<p>http://www.wordpress.org/</p>

Some text
http://www.wordpress.org/
Some more text

<p> http://www.wordpress.org/</p>
<p>http://www.apple.com/ </p>
http://www.red-sweater.com/

http://www.wordpress.org/

Thinking we can also explore some even less common cases like the URL is in a <div> or perhaps <li>.

Last edited 3 years ago by azaozz (previous) (diff)

#28 follow-up: @yosefeliezrie
3 years ago

Just a question and maybe i'm missing something but IMO there should be some sort of a filter to stop this functionality, what happens if the user just wants to link to a youtube and not embed or another link and not embed? Is there already a filter for this?

#29 in reply to: ↑ 27 @yosefeliezrie
3 years ago

i agree that we should look at different kinds of html and not just a p tag, like <div>, <li> or <section> tags and the like, the possible list may go on and not sure we can get all examples maybe just a regex for youtube within a <></> may be hard to do with regex but i'm not such a master so maybe it is possible.

Replying to azaozz:

Thinking we can also explore some even less common cases like the URL is in a <div> or perhaps <li>.

Last edited 3 years ago by yosefeliezrie (previous) (diff)

#30 in reply to: ↑ 28 ; follow-up: @Viper007Bond
3 years ago

Replying to yosefeliezrie:

Just a question and maybe i'm missing something but IMO there should be some sort of a filter to stop this functionality, what happens if the user just wants to link to a youtube and not embed or another link and not embed? Is there already a filter for this?

If they want to make a link, then they should insert a link, e.g. <a href="YouTube URL">YouTube URL</a>. Those won't be embedded, only plain text URLs will be.

Alternatively any URL that isn't all alone on its own line won't be embeded.

#31 in reply to: ↑ 30 @redsweater
3 years ago

Replying to Viper007Bond:

If they want to make a link, then they should insert a link, e.g. <a href="YouTube URL">YouTube URL</a>. Those won't be embedded, only plain text URLs will be.

Good point. I assumed the gist of the @yosefeliezrie's argument was that pasting plain links that are not embeddable does autolink them. If it were true then "stealing" that behavior for a special class of URLs that get unilaterally embedded could be a problem. But in my test just now it doesn't like for example pasting "http://trac.wordpress.org" as a standalone URL into a document, even on its own line, doesn't lead to it to be autolinked or receiving any special treatment at all.

This ticket was mentioned in Slack in #core by ocean90. View the logs.


3 years ago

This ticket was mentioned in Slack in #core by ocean90. View the logs.


3 years ago

@iseulde
3 years ago

#34 @iseulde
3 years ago

Added a patch with a possible fix and unit tests. Note that the last test would fail even without the patch.

This ticket was mentioned in Slack in #core-editor by iseulde. View the logs.


3 years ago

@iseulde
3 years ago

#36 @iseulde
3 years ago

Added case

'<p>https://w.org
</p>'

#37 @iseulde
3 years ago

A similar issue is that auto embed won't work if content is saved straight to the database as HTML. E.g. if an external editor saves HTML through the REST API. This is the only thing that breaks in WordPress if content is saved as HTML.

Last edited 3 years ago by iseulde (previous) (diff)

@azaozz
3 years ago

#38 @azaozz
3 years ago

25387.2.patch looks good. Wondering if it is worth it doing some light optimizations like 25387.3.patch? Or even something like:

if ( ! preg_match( '@(?<!")https?://@i', $content ) ) {
    return $content;
}

(return early if there are no URLs that are not preceded by a double quote).

Last edited 3 years ago by azaozz (previous) (diff)

@iseulde
3 years ago

#39 @iseulde
3 years ago

Or since both always require either whitespace or > we could use (^|\s|>)https?:// if we'd do a quick check for plain URLs.

@azaozz
3 years ago

#40 @azaozz
3 years ago

Yep, 25387.4.patch looks even better. In 25387.5.patch:

#41 @azaozz
3 years ago

  • Owner set to azaozz
  • Resolution set to fixed
  • Status changed from new to closed

In 37627:

Auto-embedding:

  • We already match URLs on their own line, add another regex to match URLs in their own paragraphs.
  • Always exclude the \s<>" characters when matching.
  • Add more unit tests.

Props iseulde, azaozz.
Fixes #25387.

Note: See TracTickets for help on using tickets.