WordPress.org

Make WordPress Core

Opened 2 months ago

Last modified 5 weeks ago

#43809 assigned enhancement

Add personal data from posts to personal data export

Reported by: TZ Media Owned by: tz-media
Milestone: Future Release Priority: normal
Severity: normal Version:
Component: Privacy Keywords: gdpr has-patch needs-testing has-unit-tests
Focuses: Cc:

Description

Add personal data from posts to the personal data export infrastructure built in #43438 and #43546.

Attachments (6)

43809.patch (4.4 KB) - added by TZ Media 2 months ago.
Posts exporter - does not handle attachments yet
43809.diff (3.2 KB) - added by desrosj 8 weeks ago.
43809.2.diff (5.9 KB) - added by desrosj 8 weeks ago.
Add unit tests.
43809.3.diff (5.8 KB) - added by desrosj 7 weeks ago.
43809.4.diff (7.1 KB) - added by desrosj 7 weeks ago.
43809.5.diff (8.3 KB) - added by desrosj 7 weeks ago.

Download all attachments as: .zip

Change History (39)

@TZ Media
2 months ago

Posts exporter - does not handle attachments yet

#1 @TZ Media
2 months ago

  • Keywords has-patch 2nd-opinion added

How should we handle attachments / files stored in the media library attached to a post?

  • Should media files that are attached to a post be exported together with the post?
  • What to do with media files used in a post, but originally uploaded by someone else?
  • What to do with media files uploaded by a user, but not used in one of his posts?

Due to these cases, I believe that attachments and posts should be handled separately, but then we lose the connection to the posts they belong to, or have to connect them in the actual output report.

This still leads to some questions about the actual export:

  • How to handle <img>-tags (and probably <a>-tags for other file formats) in post content? Should these be rewritten in some way if the actual file is included in the export?
  • If yes, would that happen in the exporter, or in the code that generates the report? Same for the actual media files.

This ticket was mentioned in Slack in #gdpr-compliance by tz-media. View the logs.


2 months ago

#3 @desrosj
2 months ago

  • Milestone changed from Awaiting Review to 4.9.6
  • Owner set to tz-media
  • Status changed from new to assigned

#4 @desrosj
2 months ago

  • Keywords reporter-feedback needs-unit-tests added

@TZ Media good first pass here. Some questions and feedback. These may have been discussed already or not apply. I did my best to look for answers to these questions but couldn't find anything.

  • Using strtolower() on the email will not return results for users with any uppercase letters in their email.
  • If a user is not found by get_user_by(), should an error be returned?
  • get_post_stati() is called twice. Can this be consolidated?
  • Can't posts with the auto-save status be considered personal data? If revision posts are included, maybe auto-saves should too?
  • The get_posts() call has posts and pages hardcoded. What if there are custom post types that plugins and themes want to be included in the export? Should public and private post types be included by default? Should this be passed to register_post_type() like show_in_rest, for example?
  • Was there a reason that get_posts() was chosen instead of a WP_Query? get_posts() will surpress filters by default. Should plugins be able to filter the WHERE and JOIN clauses?
  • The user data ticket also mentions user meta, but there are no mentions of post meta here. Should that be included here, or addressed in a separate ticket?
  • $post_data_to_export[] = array( 'name' => $name, 'value' => $value );: these should each be on their own lines.
  • $done is evaluated based on the current number of posts retrieved with no context of where the process is. If a WP_Query is used instead, it would know the total number of results, and one iteration of the function could be prevented when there are the same number of posts as the $number.
  • Was there a reason that 500 was chosen for number? That could be high, especially if post meta is added and posts have lots of meta data. Should this number be filterable? On servers with more resources, a site owner may want to process more at once.
  • Functions need inline documentation.
  • Needs unit tests.

This ticket was mentioned in Slack in #core by jeffpaul. View the logs.


8 weeks ago

#6 @allendav
8 weeks ago

A lot of this doesn't look like personal data - i.e. data that could be used to identify a person. I strongly recommend we reduce this list to just a link to the person's post and that be it.

This ticket was mentioned in Slack in #core-media by desrosj. View the logs.


8 weeks ago

@desrosj
8 weeks ago

#8 @desrosj
8 weeks ago

In 43809.diff:

  • Strip data down to just post ID, URL, and Title.
  • Change get_posts() to a WP_Query.
  • Replaced space indentation with tabs, and other alignment adjustments to follow the coding standards.
  • Added inline documentation for the added functions.
  • Removed the strtolower() call for emails (it seems that the search is case-insensitive, but this is not necessary anyway).

@desrosj
8 weeks ago

Add unit tests.

#9 @desrosj
8 weeks ago

43809.2.diff adds unit tests and fixes the following bugs discovered while writing tests:

  • When a user is not found for the email passed, the function needs to return early to prevent a PHP warning.
  • The max_num_pages check needs to use a <= sign, not a <. When the exporter is on the last page, they will be =, and when there are no results it will be <.

This ticket was mentioned in Slack in #gdpr-compliance by desrosj. View the logs.


8 weeks ago

#11 @TZ Media
8 weeks ago

  • Keywords reporter-feedback removed

Thanks @desrosj for the great work here!

Regarding the questions (as long as they are not already handled in the latest patch):

  • auto_save: Yes, they might be considered personal data.
  • CPTs: When a plugin or theme adds custom post types, I would reason that this theme or plugin is responsible for adding them to the export. Because core can hardly guess what the export for that CPT should contain, and what now. Maybe a CPT is a simple form that does not contain any personal data, maybe it's a show order that is very complicated to export, because one would have to aggregate a lot of data. So I would not include post types that are not defined in core into the export.
    • What other core post types should we include in the export by default?
  • I just used get_posts() for simplicity. There was no specific reason to do so.
  • post_meta: I believed that WP_Post includes all personal information from post_meta that is defined in core. Again, when a plugin or theme adds additional post_meta keys, I expect the plugin or theme to export them in the format they believe is best to export the values.
    • Is there any post_meta added in core that we should add to the export?
    • As I understand it, the actual export loop allows for plugins to add values to the post already exported by the core posts exporter. Because this allows plugins to add their stuff (or core stuff not exported, like post body etc.) to the export.
  • page size 500: I just copied that from the comments exporter. But as long as no complex post_meta is added here, the number should work.
  • Code style: Sorry for that. My IDE is set up to PSR-1/2 styles and our company code style by default. Will change that for core contributing.

Thanks again @desrosj for the great work.

This ticket was mentioned in Slack in #gdpr-compliance by desrosj. View the logs.


7 weeks ago

#13 @allendav
7 weeks ago

CPTs: When a plugin or theme adds custom post types, I would reason that this theme or plugin is responsible for adding them to the export.

Again, when a plugin or theme adds additional post_meta keys, I expect the plugin or theme to export them in the format they believe is best to export the values.

Agreed 100%

Last edited 7 weeks ago by allendav (previous) (diff)

#14 @azaozz
7 weeks ago

  • Keywords close added

Looking more at this: don't think there is any personal data in posts... We are trying to export parts of the post instead of any user data as there is no user data in there :)

Remember, according to the GDPR we have to give *useful* data to the user, and they should be able to reuse that data.

#15 follow-ups: @mnelson4
7 weeks ago

  • Keywords close removed

CPTs: When a plugin or theme adds custom post types, I would reason that this theme or plugin is responsible for adding them to the export. Because core can hardly guess what the export for that CPT should contain, and what now.

Ya, that's how we prefer it, at least.

I took this for a whirl and it worked (although the patch is a bit out of date, as it didn't apply clearly).

Looking more at this: don't think there is any personal data in posts... We are trying to export parts of the post instead of any user data as there is no user data in there :)

Right, there usually isn't any user data in posts. But the text of the GDPR says "‘personal data’ means any information relating to an identified or identifiable natural person" (https://gdpr-info.eu/art-4-gdpr/) That sounds like its a bit more extensive than JUST personally-identifiable data.

If this isn't closed, it seems to me it should include the post's body, not just the URL, but probably not the post ID as that doesn't seem to relate to the user.

#16 @allendav
7 weeks ago

  • Keywords needs-refresh added

#17 in reply to: ↑ 15 @allendav
7 weeks ago

Replying to mnelson4:

If this isn't closed, it seems to me it should include the post's body, not just the URL, but probably not the post ID as that doesn't seem to relate to the user.

If we include the post body (i.e. for context), let's just include a get_the_excerpt excerpt and striptags - otherwise the export report will get super long/noisy and ugly.

It seems reasonable to me that, after reviewing the export, the user may also the site owner for a more complete export of their content in a classic, portalbe WordPress export format using wp-admin/export.php

Last edited 7 weeks ago by allendav (previous) (diff)

@desrosj
7 weeks ago

#18 @desrosj
7 weeks ago

  • Keywords needs-unit-tests needs-refresh removed

Refreshed patch added. Also in 43809.3.diff:

  • Added @since tags to the functions.
  • Added @group tag to the unit tests being introduced.
  • Removed post ID from the export.

This ticket was mentioned in Slack in #gdpr-compliance by desrosj. View the logs.


7 weeks ago

#20 @allendav
7 weeks ago

@desrosj - tests well, but as written, this would pick up non-public posts attributed to the author as well. Probably should NOT include a URL for those (the URL won't work).

Instead of the URL, should probably include the post ID for those, at least for v1.

@desrosj
7 weeks ago

#21 @desrosj
7 weeks ago

  • Keywords needs-testing added; 2nd-opinion removed

In 43809.4.diff:

  • Removed unnecessary 10 priority on filter hook.
  • Remove unused $registered_post_types variable in exporter function.
  • For private posts, show the post ID and post title with 'Private:' prepended.
  • Removed unused $expected variable in tests.
  • Add tests for private posts in the data export.
  • Small alignment fixes.

#22 @desrosj
7 weeks ago

  • Keywords has-unit-tests added

This ticket was mentioned in Slack in #gdpr-compliance by xkon. View the logs.


7 weeks ago

@desrosj
7 weeks ago

#24 @desrosj
7 weeks ago

43809.5.diff expands on 43809.4.diff by:

  • Properly ignore auto-draft posts (not auto-save).
  • Treat password protected posts like private posts (prepend Password Protechted:, only show post title and post ID).
  • Variable properties should be wrapped in curly braces (see https://core.trac.wordpress.org/ticket/43440#comment:20).

#25 in reply to: ↑ 15 @azaozz
7 weeks ago

  • Milestone changed from 4.9.6 to Future Release

Replying to mnelson4:

I think we already discussed this a month ago and decided that posts are NOT personal data that needs to be exported.

Looking more at this: don't think there is any personal data in posts... We are trying to export parts of the post instead of any user data as there is no user data in there :)

Right, there usually isn't any user data in posts. But the text of the GDPR says "‘personal data’ means any information relating to an identified or identifiable natural person" (https://gdpr-info.eu/art-4-gdpr/) That sounds like its a bit more extensive than JUST personally-identifiable data.

Right, but that doesn't mean we should export post content. If we follow that logic, attachments to the post are also associated with the post author, right? But someone else may have uploaded an image and the post author just used it. So why are we giving another person's data to the post author? What about comments? They are also associated with the post, do we include them too?

Another relatively common case is that posts are authored by more than one person. Again, do we want to give someone else's "personal data" to the post author? What about copyright laws, do we have to break them? :)

Also: which exact part of the GDPR is this complying with?

  • The right to see your data? Nope, any registered user can see all of their posts at any time in (at least) couple of ways: as an archive on the front-end, and in a list-table in the admin. Please keep in mind that registered users are "controllers" of their own data.
  • The right to reuse your data? Nope, that export cannot be reused as the post content is not complete. It misses any changes that may have happened on different actions and filters. The user will be able to get much better content if copying from an author archive view on the front-end.

IMHO the right thing to do here is punt this for now and eventually come back to it when it is proven that we MUST introduce yet another (forth?) method of exporting posts.

This ticket was mentioned in Slack in #gdpr-compliance by iandunn. View the logs.


7 weeks ago

#27 @iandunn
7 weeks ago

I tend to agree with Ozz on this, the vast majority of posts will not contain personal data, and are already publicly available, and editable by the author. I think this is something that can probably be handled manually on a case-by-case basis.

I think this could likely be a situation where the GDPR only requires us to be transparent in the privacy policy about the fact that posts are not included in export/anonymize tools because of practical limitations.

Or at the very least, we need some more discussion to interpret the GDPR's requirements in this context.

#28 @mnelson4
7 weeks ago

Yeah I'm on board with punting it if it's still debatted whether this is needed. No sense adding it if it isn't.

I thought this was an effort to provide data portability (I get the impression this type of data should be available to users from articles like https://www.i-scoop.eu/gdpr/right-to-data-portability/, especially the section on "What data portability means"). Anyways, I'm sure if this is important more voices will chime in to that extent.

And yes, the fact there are multiple authors does muddy the waters as to WHOSE personal data it is.

#29 follow-ups: @allendav
7 weeks ago

@mnelson4 - data portability wasn't the intention for this feature - this was solely looking at Right of Access (Article 15).

For data portability (Article 20) right now I think we are relying on a combination of this right of access personal data export alongside the core WordPress export. This (data portability) is definitely a place I think we can iterate on following 4.9.6

I'm not even sure how to bound the data portability problem with respect to plugins yet :P Any ideas? Would a shopping cart plugin be expected to output order data in a format that other e-commerce platforms could use?

#30 in reply to: ↑ 29 @mnelson4
7 weeks ago

Replying to allendav:

@mnelson4 - data portability wasn't the intention for this feature - this was solely looking at Right of Access (Article 15).

For data portability (Article 20) right now I think we are relying on a combination of this right of access personal data export alongside the core WordPress export. This (data portability) is definitely a place I think we can iterate on following 4.9.6

I'm not even sure how to bound the data portability problem with respect to plugins yet :P Any ideas? Would a shopping cart plugin be expected to output order data in a format that other e-commerce platforms could use?

Basically, yes I think for a shopping cart plugin that would probably include order data. As I understand it, it's basically all information the site has about that person and what they've done.

The link I posted above gave some examples:

...a data subject could, as the WP29 guidelines put it, be interested in retrieving a current playlist or history of listened tracks so they can, for example check how often they’ve listened to songs, find a few albums they might want to buy or listen to when using another platform.

... Other examples include digital data regarding the books a data subject has bought via an online bookstore or general e-commerce platform, data regarding their contacts for which they use a certain application (and thus company and data controller) in order to prepare an invitation to an event with another platform, data regarding their energy usage (with the energy company as the data controller) to check out their carbon footprint, data regarding their location history from their service provider, their online search history, you can imagine ample more examples.

#31 in reply to: ↑ 29 @TZ Media
7 weeks ago

Replying to allendav:

@mnelson4 - data portability wasn't the intention for this feature - this was solely looking at Right of Access (Article 15).

I must have missed that. But that explains a lot of misunderstandings between those asking "Is this personal data? Do we need to export this?" vs. those - like me - saying "If it can be associated with a user ID or email address, it must be exported". And a lot of discussions that seemed pointless to me now make sense.

I'm not even sure how to bound the data portability problem with respect to plugins yet :P Any ideas?

We already build that during the last few weeks. We have a framework that plugins can hook into to add any data they want to the export.

Would a shopping cart plugin be expected to output order data in a format that other e-commerce platforms could use?

I'm not aware of widely accepted standard format for order data, so: No, it shouldn't, at least for now. And if there was such a format, that's also no problem, as explained below. The plugin could already do that easily.

We have build a framework that by all measures is fully capable of fulfilling the requirements for data portability, last but not least by implementing the ideas to make the report generation hookable to generate custom formats (that might be standard formats for certain types of data that are required - or just more convenient - for data portability), and add these additional files to the data export .zip by another hook. So I don't understand why we are limiting this amazing feature to only a small portion of its potential.

#32 @azaozz
7 weeks ago

Yeah,

Data subjects have a right to receive personal data which concern them and which they have provided to a controller organization...

also:

It is clear that the right to data portability does not cover all personal data. However, the data may still fall under the right to access.

So basically IP addresses, domain names, browser UA, IDs, hashes, and any other generated or automatically assigned data that is not reusable does not fall under the data portability rules.

Also, the "data portability export" has to be a file in a common machine readable format like CSV, JSON, XML, etc. As mentioned above we already have that, and more, in the WP exporter.

#33 @desrosj
5 weeks ago

  • Component changed from General to Privacy

Moving to the new Privacy component.

Note: See TracTickets for help on using tickets.