Make WordPress Core

Opened 6 years ago

Closed 6 years ago

Last modified 6 years ago

#43545 closed enhancement (fixed)

Helper functions: Anonymizing data in a standardized way

Reported by: dejliglama's profile dejliglama Owned by: azaozz's profile azaozz
Milestone: 4.9.6 Priority: normal
Severity: normal Version: 5.1
Component: Privacy Keywords: needs-patch gdpr
Focuses: Cc:

Description

To help GDPR Compliance with the anonymization functionality, we need to consider

  • returning anonymized data based on the type of field holding the data eg. e-mail, address, number sequences and other
  • returning data that can't be misunderstood as real data (making sense locally as well as to the administrator of a site


  • ?

A proposal has been to consider a general format with a prefix, an MD5 hash of the data, and a postfix to make the data compliant with the field type.
eg:
johndoe@… > [deleted]-[MD5 of Johndoe@…][postfix]
The e-mail would then become: deleted-0c47598f5738e73c843552572a371543@…

The prefix and postfix of each field type should be localized.

Furthermore, field data that doesn't require MD5 hashing like standard addresses: 123 Main Street - could be localized.

Attachments (9)

43545.1.diff (2.8 KB) - added by jesperher 6 years ago.
anonymize helper function 1
43545.2.diff (1.4 KB) - added by azaozz 6 years ago.
43545.diff (5.8 KB) - added by allendav 6 years ago.
Adds IPV4 and IPV6 anonymization functions and unit tests for everything
43545.3.diff (5.8 KB) - added by azaozz 6 years ago.
43545.4.diff (8.8 KB) - added by azaozz 6 years ago.
43545.5.diff (17.0 KB) - added by iandunn 6 years ago.
Move $ip_prefix and tests, minor cleanup. See comment:30.
43545.6.diff (11.0 KB) - added by birgire 6 years ago.
43545.7.diff (18.4 KB) - added by birgire 6 years ago.
43545.8.diff (1.5 KB) - added by joemcgill 6 years ago.

Download all attachments as: .zip

Change History (57)

#1 @dejliglama
6 years ago

  • Keywords needs-patch gdpr added

#2 @allendav
6 years ago

Copied from #43550 (accidental duplicate)

For example, come up with functions to provide replacement text for things like name and address fields that can be used by plugins when removing personal data. The ideal text will be recognizable by the administration as having been thus removed, e.g. "Anonymized User".

Since email addresses are at the core of our export and removal strategy, the anonymized email address function should also probably support a random hash or some such, e.g. anonymized-user-123456abcdef@… to avoid creating a situation where all anonymized content ends up pointing at a single fictitious user.

From the ticket scrub discussion in Making WordPress gdpr-compliance chat today.

#3 @allendav
6 years ago

#43550 was marked as a duplicate.

#4 follow-up: @jesperher
6 years ago

We could do something like this:

<?php
function wp_privacy_anonymize_data( $data_to_anonymize = '', $type = 'text' ) {
        switch ( $type ) {
                case 'email':
                    $data_to_anonymize = ...
                    break
                ...
                ...
       }
        return $data_to_anonymize;
}

That would mean we could expand it with multiple types, and also allow filtering on the data to be returned.
fx:

<?php
/**
 * Anonymizes data for the GDPR "Right to Anonymization".
 *
* @since 5.0.0
 *
 * @param string $data_to_anonymize.
 * @param string $type the type of the data to be anonymized.
 *
 * @return string $data_to_anonymize as anonymized.
 */
function wp_privacy_anonymize_data( $data_to_anonymize = '', $type = 'text' ) {
        if( ! $data_to_anonymize) {
                return false;
        }

        $type = trim( strtolower( $type ) );

        $orginal_data_to_anonymize = $data_to_anonymize;

        switch ( $type ) {
                case 'id':
                        // should we ever make something to do this?
                        break;
                case 'email':
                        /* translators: IP to anonymize */
                        $data_to_anonymize = __('anonymized-email') . '-' . wp_hash( $data_to_anonymize ) .'@example.com';
                        break;
                case 'url':
                        /* translators: url to anonymize */
                        $data_to_anonymize = __('example.com');
                        break;
                case 'ip':
                        /* translators: IP to anonymize */
                        $data_to_anonymize = __('0.0.0.0');
                        break;
                case 'agent':

                        break;
                case 'float':

                        break;
                case 'number':
                case 'int':
 
                        break;
                case 'date':
                        //should we ever make something to do this? - can dates be personal?
                        break;
                case 'name':

                break;
                case 'text':
                        /* translators: text to anonymize */
                        $data_to_anonymize = __('anonymized-text');
                        break;
                case 'shorttext':
                        /* translators: shorttext to anonymize */
                        $data_to_anonymize = __('anonymized-text. This text has been deleted on request of the person who wrote it.');
                        break;
                case 'longtext':
                        /* translators: longtext to anonymize */
                        $data_to_anonymize = __('anonymized-text. This text has been deleted on request of the person who wrote it. This text has been deleted on request of the person who wrote it. This text has been deleted on request of the person who wrote it. This text has been deleted on request of the person who wrote it. This text has been deleted on request of the person who wrote it. This text has been deleted on request of the person who wrote it.');
                        break;
                default:
                        //keep this empty to allow filtering for other types
                        break;
        }

        // allow for others to filter data_to_anonymize
        $data_to_anonymize = apply_filters( 'wp_privacy_anonymize_data', $data_to_anonymize, $type, $orginal_data_to_anonymize );

        // allow for others to filter data_to_anonymize specific for type
        $data_to_anonymize = apply_filters( "wp_privacy_anonymize_data_{$type}", $data_to_anonymize, $type, $orginal_data_to_anonymize );

        return $data_to_anonymize;
}

Then we could create small helper functions fx:

<?php
function wp_privacy_anonymize_ip( $ip_to_anonymize ) {
        return wp_privacy_anonymize_data( $ip_to_anonymize, 'ip' );
}
Last edited 6 years ago by jesperher (previous) (diff)

#5 @jesperher
6 years ago

What file will these functions fit into, Do you think?

/includes/functions.php


Or is there another file that is more appropriate ?

Last edited 6 years ago by jesperher (previous) (diff)

@jesperher
6 years ago

anonymize helper function 1

#6 follow-up: @dejliglama
6 years ago

In relation to e-mail anoymization (and possibly other fields), the requirement of an e-mail to be unique should be thought into this solution so that e-mails stay unique even after anonymization.

#7 in reply to: ↑ 6 @jesperher
6 years ago

Replying to dejliglama:

In relation to e-mail anoymization (and possibly other fields), the requirement of an e-mail to be unique should be thought into this solution so that e-mails stay unique even after anonymization.

Emails, is already unique, since we already use it as an ID.

The situation where an two hashed emails gets hashed to the same, is very unlikely.
And if the there still needs to be a relation between seperate data, based on email as link, it wouldn't be good to make 2 different emails, based on the same email.

But maybe we could make a "unique" parameter to the function, and if its true, we include a unique ID

#8 @birgire
6 years ago

Thanks for the patch @jesperher

If we use a wrapper like this one:

function wp_privacy_anonymize_ip( $ip_to_anonymize ) {
        return wp_privacy_anonymize_data( $ip_to_anonymize, 'ip' );
}

then we could use:

case 'ip':
	$data_to_anonymize = _wp_privacy_anonymize_ip_handler( $data_to_anonymize );
	break;

in wp_privacy_anonymize_data(), where we could use the IP anonymizing part from the WP_Community_Events::get_unsafe_client_ip() method in (src):

/**
 * Partially anonymize the IP address by reducing it to the corresponding network ID.
 *
 * Modified from https://github.com/geertw/php-ip-anonymizer, MIT license.
 *
 * @since 5.0.0
 * @access private

 * @param  string       $ip IP address.
 * @return false|string $ip The anonymized IP address on success; the given
 *                          address or false on failure.
 */
function _wp_privacy_anonymize_ip( $ip ) {
	$ip_prefix = '';
	$netmask   = false;

	// Detect what kind of IP address this is.
	$is_ipv6 = substr_count( $ip, ':' ) > 1;
	$is_ipv4 = ( 3 === substr_count( $ip, '.' ) );

	if ( $is_ipv6 && $is_ipv4 ) {
		// IPv6 compatibility mode, temporarily strip the IPv6 part, and treat it like IPv4.
		$ip_prefix = '::ffff:';
		$ip = preg_replace( '/^\[?[0-9a-f:]*:/i', '', $ip );
		$ip = str_replace( ']', '', $ip );
		$is_ipv6   = false;
	}

	if ( $is_ipv6 ) {
		// IPv6 addresses will always be enclosed in [] if there's a port.
		$ip_start = 1;
		$ip_end   = (int) strpos( $ip, ']' ) - 1;
		$netmask  = 'ffff:ffff:ffff:ffff:0000:0000:0000:0000';
		
		// Strip the port (and [] from IPv6 addresses), if they exist.
		if ( $ip_end > 0 ) {
			$ip = substr( $ip, $ip_start, $ip_end );
		}

		// Partially anonymize the IP by reducing it to the corresponding network ID.
		if ( function_exists( 'inet_pton' ) && function_exists( 'inet_ntop' ) ) {
			$ip = inet_ntop( inet_pton( $ip ) & inet_pton( $netmask ) );
		}
	} elseif ( $is_ipv4 ) {
		// Strip any port and partially anonymize the IP.
		$last_octet_position = strrpos( $ip, '.' );
		$ip                  = substr( $ip, 0, $last_octet_position ) . '.0';
	} else {
		return false;
	}

	// Restore the IPv6 prefix to compatibility mode addresses.
	$ip = $ip_prefix . $ip;
	
	return $ip;	
}

Note that the above version is currently suited for the WP_Community_Events::get_unsafe_client_ip() method, where it returns the full IP address if e.g. the functions inet_pton() and inet_ntop() do not exists. That's intentional behavior there, see e.g. #41083.

I think we might need to adjust the full IP address return here, with something else. Also should the function return false or e.g. WP_Error instead on failure? But it would be great if this function could be re-used in e.g. the WP_Community_Events::get_unsafe_client_ip() method.

Last edited 6 years ago by birgire (previous) (diff)

#9 @allendav
6 years ago

I like what @birgire has proposed - i'd use it in #43588 :)

#10 follow-up: @postphotos
6 years ago

Thanks @jesperher for that detailed list!

I wanted to comment on this bit:

case 'date':
                        //should we ever make something to do this? - can dates be personal?

Birthdates or the day when a user joins a service, for example, could partly reveal identity as it filters a larger dataset into a smaller one.

It reminds me of the great dialogue in the #gdpr-compliance Slack channel about how user agents aren't exactly anonymous... Also, the EFF wrote a great post a few years back about what a user agent contains.

My proposal would be to, on random submit for time, just return 0 UTC/Thursday, January 1, 1970, which should indicate, universally, that this date was redacted.

#11 follow-up: @azaozz
6 years ago

This sounds good but imho there are far too many options to anonymize things. It would be handy for translatable strings but not for numbers, tokens, etc. For example: what would be an anonymized id or email or url? All these should be empty strings.

I'd also argue that an anonymized IP address is 0.0.0.0 (if not an empty string). Don't see any value to try and keep partial user data for general purposes. If a plugin requires to retain specific data, it can choose to anonymize it in a specific way.

I like the translatable text and longtext, but they seem too long? Something like Deleted and Deleted by the author. would be enough imho.

Also, two different filters, one of which is dynamic (a.k.a very hard to read/use) for few strings seem excessive :)

#12 in reply to: ↑ 10 @jesperher
6 years ago

Replying to postphotos:

case 'date':
                        //should we ever make something to do this? - can dates be personal?

Birthdates or the day when a user joins a service, for example, could partly reveal identity as it filters a larger dataset into a smaller one.

It reminds me of the great dialogue in the #gdpr-compliance Slack channel about how user agents aren't exactly anonymous... Also, the EFF wrote a great post a few years back about what a user agent contains.

My proposal would be to, on random submit for time, just return 0 UTC/Thursday, January 1, 1970, which should indicate, universally, that this date was redacted.

I think you are right on this one. I will make it so it returns 0 / January 1, 1970, nomatter what the input is.

For the user agents - i dont see other solution right now, than just to return an empty string.

#13 in reply to: ↑ 11 @jesperher
6 years ago

Replying to azaozz:

This sounds good but imho there are far too many options to anonymize things. It would be handy for translatable strings but not for numbers, tokens, etc. For example: what would be an anonymized id or email or url? All these should be empty strings.

I kind of agree on numbers - when first creating the list, it was kind of a brainstorm on different types of data.

For the ID part, i dont think we can anonymize that.
But for the email part i still think this i very relevant, because of the way WordPress uses mail adresses, and how a lot of code normally us depending on the the data to be of a valid email type.
Then i belive it is better, to create a "standard" of how it us anonymized, and help plugin developers to use this feature, istead of it beeing done on 1000 different ways.

I'd also argue that an anonymized IP address is 0.0.0.0 (if not an empty string). Don't see any value to try and keep partial user data for general purposes. If a plugin requires to retain specific data, it can choose to anonymize it in a specific way.

We could maybe create a Ip_zero, and a IP_partially or something like that, to make both methods avaviable.

I like the translatable text and longtext, but they seem too long? Something like Deleted and Deleted by the author. would be enough imho.

You might be right on this.

Also, two different filters, one of which is dynamic (a.k.a very hard to read/use) for few strings seem excessive :)

I disagree.
It is also used in that way other places in core, with dynamic hooks. Like on creating/saving posts.

#14 follow-up: @azaozz
6 years ago

We could maybe create a Ip_zero, and a IP_partially...

Actually think there was a court ruling in Germany requiring anonymized IPs to be cut to only the first two octets, i.e. 192.168.0.0. This makes them really useless for any purpose :(

Last edited 6 years ago by azaozz (previous) (diff)

#15 @azaozz
6 years ago

  • Milestone changed from Awaiting Review to 5.0

#16 @allendav
6 years ago

Instead of 0.0.0.0, it is helpful to retain the first octets of the IP address so you can still see where (geographically) your commenters are coming from without exposing their identify (which the full IP address could.)

For IPV4, let's zero the last octet and for IPV6 the last ten. If a site needs more than that, let's make the IP anonymization filterable

@azaozz
6 years ago

#17 @azaozz
6 years ago

In 43545.2.diff: make the changes discussed in https://core.trac.wordpress.org/ticket/43545#comment:11 and https://core.trac.wordpress.org/ticket/43545#comment:13. I've also removed the $data_to_anonymize param since it became unused.

If we want to be able to anonymize IP addresses by truncating them, we can extract the function from the Community Events dashboard widget https://core.trac.wordpress.org/browser/tags/4.9.4/src/wp-admin/includes/class-wp-community-events.php#L210 and reuse it. However seems that this will not comply with the privacy laws in some countries.

Also, seems that the comment text is considered "personal data", so we will have to delete the whole comments instead of "anonymizing" them. Then there will be no need for truncated IPs for comments.

Last edited 6 years ago by azaozz (previous) (diff)

#18 in reply to: ↑ 4 @idea15
6 years ago

NB there is no "GDPR right to anonymisation" (I think you may be conflating two different things there.) Anonymisation is highly recommended under GDPR but it's not one of the fundamental user rights - it's one of the means which could make it possible.

Capital P dangit

  • Anonymizes data for the GDPR "Right to Anonymization".
Version 0, edited 6 years ago by idea15 (next)

#19 in reply to: ↑ 14 ; follow-up: @pputzer
6 years ago

Replying to azaozz:

We could maybe create a Ip_zero, and a IP_partially...

Actually think there was a court ruling in Germany requiring anonymized IPs to be cut to only the first two octets, i.e. 192.168.0.0. This makes them really useless for any purpose :(

Have you got any source for this? I know of no such ruling and have not been able to find any German article that mentions it. All German articles on the topic of IP anonymization seem talk about killing the last octet (or using Google's standard _anonymizeIp() function).

#20 in reply to: ↑ 19 ; follow-up: @azaozz
6 years ago

Replying to pputzer:

Have you got any source for this? I know of no such ruling and have not been able to find any German article that mentions it. All German articles on the topic of IP anonymization seem talk about killing the last octet (or using Google's standard _anonymizeIp() function).

Somebody mentioned this few weeks ago in Slack (I think) but can't find anything more about it either. Thinking we can consider this as "not real unless proven otherwise" :)

#21 @allendav
6 years ago

Instead of example.test, could we use example.com or example.org per https://www.iana.org/domains/reserved

@allendav
6 years ago

Adds IPV4 and IPV6 anonymization functions and unit tests for everything

#22 follow-up: @allendav
6 years ago

@azaozz - I added IPV4 and IPV6 anonymization and unit tests for the lot. I also modified the domains to example.com and the time stamp to a more appropriate default

To run the unit tests, run

phpunit tests/phpunit/tests/functions/anonymization.php

Please let me know if adding a unit test file will be a no-no for a minor release and i can add these somewhere else

This ticket was mentioned in Slack in #gdpr-compliance by allendav. View the logs.


6 years ago

#24 in reply to: ↑ 22 @azaozz
6 years ago

Replying to allendav:

43545.diff looks good. There is one place where it may fail to anonymize an IPv6 address, if ! function_exists( 'inet_pton' ). I'll fix that.

Was also thinking we can reuse the "extracted" function in WP_Community_Events: https://core.trac.wordpress.org/browser/tags/4.9.4/src/wp-admin/includes/class-wp-community-events.php#L210

...modified the domains to example.com and the time stamp to a more appropriate default

Not sure if that's better. Thinking that it should be changed to something that can never be used. The example.com is an actual website, it belongs to "somebody". However the .test domain will never ever exist.

@azaozz
6 years ago

#25 @azaozz
6 years ago

In 43545.3.diff:

  • Fix output from wp_privacy_anonymize_ip() when inet_pton() or inet_ntop() don't exist.
  • Make wp_privacy_anonymize_data() return an empty string when $type is not passed.
  • Change the anonymous site address to site.invalid. That way nobody ever can mistake it for a "real site" as it will always return an error when trying to reach it.
Last edited 6 years ago by azaozz (previous) (diff)

#26 follow-up: @birgire
6 years ago

I was wondering about the current 0.0.0.0 output for an IPv6 address.

For example:

wp_privacy_anonymize_ip( '2001:db8:a0b:12f0::1' ) now outputs '0.0.0.0' when inet_ntop() or inet_pton() are not available.

The unspecified address for IPv6 is 0:0:0:0:0:0:0:0 or ::.

Would :: be expected in the above example?

Some other suggestions for 43545.3.diff:

Make the tests aware of missing inet_ntop()/inet_pton() functions, with e.g.

if ( ! function_exists( 'inet_ntop' ) ) {
	$this->markTestSkipped( 'inet_ntop extension not available.' );
}
if ( ! function_exists( 'inet_pton' ) ) {
	$this->markTestSkipped( 'inet_pton extension not available.' );
}

or modify the expected value explicitly like:

if( function_exists( 'inet_ntop' ) && function_exists( 'inet_pton' ) ) {
    $expected = '...';
} else {
    $expected = '0.0.0.0';
}

Adjustments according to the Coding Standard.

Should we use data providers here instead of for loops?

Last edited 6 years ago by birgire (previous) (diff)

#27 in reply to: ↑ 26 @azaozz
6 years ago

Replying to birgire:

I was wondering about the current 0.0.0.0 output for an IPv6 address.

Yeah, can change that to :: although (afaik) IPv4 and IPv6 addresses are fully interchangeable.

This ticket was mentioned in Slack in #gdpr-compliance by allendav. View the logs.


6 years ago

#29 follow-up: @iandunn
6 years ago

FYI, there was a recent bug fix to the Events Widget IP anonymization in r42968, so that should probably be synced here.

Although, rather than duplicating that code, would it be better to abstract it into a modular function that can be used by both the GDPR anonymization and the Events Widget anonymization?

@azaozz
6 years ago

#30 in reply to: ↑ 29 @azaozz
6 years ago

Replying to iandunn:

FYI, there was a recent bug fix to the Events Widget IP anonymization in r42968, so that should probably be synced here.

Yep, thanks for the ping! :)

Although, rather than duplicating that code, would it be better to abstract it into a modular function that can be used by both the GDPR anonymization and the Events Widget anonymization?

Yeah, was thinking that too. In 43545.4.diff:

  • Abstract the code that anonymizes IP address and use it in both the Events Widget and wp_privacy_anonymize_data().
  • Make $type required in wp_privacy_anonymize_data(). The right data type is pretty important. Thinking it's better to always pass it rather than get an empty string for unspecified types. That will let us add more types in the future if needed.
  • Clean up the test a bit.

@iandunn
6 years ago

Move $ip_prefix and tests, minor cleanup. See comment:30.

#31 @iandunn
6 years ago

43545.4.diff looks pretty good to me. I made a few minor changes in 43545.5.diff:

src:

  • Move $ip_prefix from get_unsafe_client_ip() wp_privacy_anonymize_ip() to preserve the default value (although the NULL would probably get converted to '' anyway).
  • Minor cleanup (trailing spaces, full stops, etc).

tests:

  • Moved the tests from Test_WP_Community_Events::test_get_unsafe_client_ip_anonymization() to Tests_Functions_Anonymization, so that they remain attached to the function they're testing.
  • Refactored Test_WP_Community_Events::test_get_unsafe_client_ip() to only test the functionality left in get_unsafe_client_ip().
  • Swapped function params in unit test calls to wp_privacy_anonymize_ip(), so that the tests pass.

It might be worth noting that the Events Widget doesn't strictly need $ipv6_fallback mode, if using the un-anonymized IP is going to cause GDPR issues. If only anonymized IPs were sent to api.wordpress.org, then a small subset of users -- those running PHP < 5.3 on Windows -- would lose the functionality that automatically displays events based on geolocating the IP address. Instead, they'd have to manually enter their location.

Automatically detecting the location and showing local events is one of key features of the Events Widget, because the demographic it primarily targets are those users who aren't already aware of the WP community, and are therefore less likely to put in the effort to explore the dashboard widget and manually fetch the events. Having said that, though, removing that functionality could be an acceptable compromise.

If we do keep $ipv6_fallback mode, then we should probably add some tests for it.

#32 @iandunn
6 years ago

If this gets bumped from `5.0` to `4.9.x`, then it'd probably be best to move the tests from tests/functions/anonymization.php to tests/functions.php, to avoid adding a new file in a point release.

#33 @azaozz
6 years ago

@iandunn thanks for the review and the fixes! 43545.5.diff looks good. Going to commit it later today if no other remarks.

...the Events Widget doesn't strictly need $ipv6_fallback mode, if using the un-anonymized IP is going to cause GDPR issues.

Was thinking about that too. For privacy we have to guarantee we return an anonymized IP, no matter what. Since we depend on the PHP version/environment to be able to do anonymization of IPv6 addresses, it's probably a good idea to be able to fall back gracefully. However removing that will make wp_privacy_anonymize_ip() simpler to use... Kind of 50/50 there :)

If this gets bumped from 5.0 to 4.9.x...

No problems with new files in /tests. They are only problematic in /build for minor releases.

@birgire
6 years ago

#34 @birgire
6 years ago

I was testing various IP address inputs, when I noticed that for e.g.:

$ip_addr = ':::';
echo inet_pton( $ip_addr );

gives a PHP warning:

Warning:  inet_pton(): Unrecognized address ::: in [...][...] on line 3 

We could handle it with e.g.:

try {
	$ip_addr = inet_ntop( inet_pton( $ip_addr ) & inet_pton( $netmask ) );
	if( false === $ip_addr ) {
		return '::';
	}
} catch( Exception $error ) {
	return '::';
}

where we also handle the possible false output of inet_ntop().

I checked and these warnings will be removed in PHP 7.1+

https://github.com/php/php-src/pull/3200

The 43545.6.diff patch includes suggestions that:

tests part:

  • Declares visibility and adds doc comments for
    • test_anonymize_email(),
    • test_anonymize_url(),
    • test_anonymize_date(),
    • test_anonymize_text(),
    • test_anonymize_long_text().
  • Adds misisng full-stops to inline comments.
  • Adds doc parameters for $raw_ip and $expected_result.
  • Detailed @return for data_wp_privacy_anonymize_ip().
  • Adds a file doc comment.
  • Adds short doc comment for the class.
  • Makes the tests explicit in test_anonymize_text(), test_anonymize_long_text(), i.e. use assertEquals() instead of assertNotEquals().
  • Adds the invalid IP testcase: ::: -> :: that would give a PHP warning if unhandled.
  • Adds the invalid IP testcase: null -> 0.0.0.0.
  • Adds the netmask testcase: 10.20.30.45/24 -> 10.20.30.0.
  • Adds a markTestSkipped in test_wp_privacy_anonymize_ip() if inet_pton and inet_pton are not available. Otherwise most of the IPv6 testcases, provided by data_wp_privacy_anonymize_ip(), would fail. Another approach would be to adjust the testcases to to handle that.
  • Replaces: Test that get_unsafe_client_ip() properly anonymizes all possible address formats. with: Test that wp_privacy_anonymize_ip() properly anonymizes all possible IP address formats.

src part:

  • Adds @uses inet_ntop and inet_pton if available, to handle IPv6 addresses.
  • Handle a possible PHP warning from inet_pton and inet_pton for an unrecognized address.
  • Handle a possible false output of inet_ntop().


To consider:

Example:

echo wp_privacy_anonymize_data( 'ip', '...' );
 

outputs: '...0' instead of 0.0.0.0.

Another example is how out-of-range IPv4 should be handled:

echo wp_privacy_anonymize_data( 'ip', '999.999.999.999' );
 

outputs: '999.999.999.0' but should it be instead 0.0.0.0 ?

One might aim for a stricter IPv4 validation in a future ticket or e.g. ticket #34538 Improvement of the IPv4 address format check on WP_Http::is_ip_address.

I wonder if there should be a general WP IP utility (possible future ticket) to avoid IP handling duplications in various parts of the core (e.g. db hosts, HTTP API, Rest, Multisite support of IPv6, ... ).

Last edited 6 years ago by birgire (previous) (diff)

#35 @iandunn
6 years ago

No problems with new files in /tests. They are only problematic in /build for minor releases.

Doh, good point :)


43545.6.diff has lots of good tweaks. A few minor questions:

  • Should @uses be added? It sounds like the standard discourages it now, and I personally feel like it adds a maintenance overhead without being particularly useful.
  • I don't think inet_(ntop|pton) throw any exceptions, so is there a benefit of using try/catch?
  • It looks like the changes to src/wp-admin/includes/class-wp-community-events.php and tests/phpunit/tests/admin/includesCommunityEvents.php were left out.

@birgire
6 years ago

#36 @birgire
6 years ago

Thanks @iandunn, I wasn't aware of the @uses being discouraged, I only consulted core here regarding that tag ;-)
You're absolutely right about the try/catch not working for PHP warnings, it was the first thing that I tried and it somehow worked well with phpunit ;-)

43545.7.diff should include all the four files and it removes the @uses, try/catch and the unrecognized address testcase.

#37 @azaozz
6 years ago

  • Owner set to azaozz
  • Resolution set to fixed
  • Status changed from new to closed

In 42971:

Privacy: add helper function for anonymizing data in a standardized way.

Props jesperher, allendav, iandunn, birgire, azaozz.
Fixes #43545.

#38 @azaozz
6 years ago

  • Milestone changed from 5.0 to 4.9.6
  • Resolution fixed deleted
  • Status changed from closed to reopened

Reopen for "next minor".

This ticket was mentioned in Slack in #gdpr-compliance by azaozz. View the logs.


6 years ago

This ticket was mentioned in Slack in #core by jeffpaul. View the logs.


6 years ago

@joemcgill
6 years ago

#41 @joemcgill
6 years ago

43545.8.diff Updates version numbers in the docblocks before backporting. This needs to be backported before r42994, since that commit depends on functionality from this ticket.

This ticket was mentioned in Slack in #gdpr-compliance by desrosj. View the logs.


6 years ago

This ticket was mentioned in Slack in #gdpr-compliance by allendav. View the logs.


6 years ago

#44 @SergeyBiryukov
6 years ago

In 43081:

Docs: Update @since version numbers for wp_privacy_anonymize_ip() and wp_privacy_anonymize_data().

Props joemcgill.
See #43545.

#45 @SergeyBiryukov
6 years ago

  • Resolution set to fixed
  • Status changed from reopened to closed

In 43082:

Privacy: add helper function for anonymizing data in a standardized way.

Props jesperher, allendav, iandunn, birgire, azaozz, joemcgill.
Merges [42971] and [43081] to the 4.9 branch.
Fixes #43545.

#46 @desrosj
6 years ago

  • Component changed from Options, Meta APIs to Privacy

Moving to the new Privacy component.

#47 in reply to: ↑ 20 ; follow-up: @sgreger
6 years ago

Replying to azaozz:

Replying to pputzer:

Have you got any source for this? I know of no such ruling and have not been able to find any German article that mentions it. All German articles on the topic of IP anonymization seem talk about killing the last octet (or using Google's standard _anonymizeIp() function).

Somebody mentioned this few weeks ago in Slack (I think) but can't find anything more about it either. Thinking we can consider this as "not real unless proven otherwise" :)

I am not aware of any court rulings, but the consideration about a potential requirement of cutting to only the first two octets (IPv4) may stem from the EU Article 29 Working Group's Opinion paper WP148 (2008):

"Currently, some search engine providers truncate IPv4 addresses by removing the final octet, thus in effect retaining information about the user's ISP or subnet, but not directly
identifying the individual. The activity could then originate from any of 254 IP addresses. This may not always be enough to guarantee anonymisation."
(emphasis mine)

The "Unabhängiges Landeszentrum für Datenschutz Schleswig-Holstein", one of Germany's 16 Federal Data Protection Authorities references that text in their FAQ on IP addresses (in German) to suggest obfuscation of two octets as acceptable.

This is the opinion of public advisory bodies on data protection, not binding law; deleting the last octet appears to indeed be common practice in Germany, though some sources (legal blogs, mainly) tend to recommend deleting two. Since privacy compliance is always about minimising risk, not absolute rules, I believe that at least a filter to set a higher anonymization level could be worth considering?

#48 in reply to: ↑ 47 @pputzer
6 years ago

Replying to sgreger:

This is the opinion of public advisory bodies on data protection, not binding law; deleting the last octet appears to indeed be common practice in Germany, though some sources (legal blogs, mainly) tend to recommend deleting two. Since privacy compliance is always about minimising risk, not absolute rules, I believe that at least a filter to set a higher anonymization level could be worth considering?

Thanks, now we at least know where that is coming from. While I still think that removing the last octet is fine, having a filter hook seems like a good idea.

Note: See TracTickets for help on using tickets.