Skip to content

Commit

Permalink
Fix and augment check-for-inclusive-language CI check (#29549)
Browse files Browse the repository at this point in the history
* Fix and augment `check-for-inclusive-language` CI check

Related: #15994 #23090

This PR addresses a few items related to inclusive language use and the CI check:

- There are several occurrences of "dummy" throughout documentation; however the current CI check for preventing non-inclusive language doesn't inspect `docs/` files. Ideally, the docs also include inclusive language.
- Even when removing `docs/` from the exclusion list, non-inclusive language was still escaping pygrep. Upon inspection, the `(?x)` inline modifier was missing from the regex (although intended in #23090). Adding this modifier revealed these "dummy" instances and others related non-inclusive occurrences which were previously uncaught.
- The exclusion list seemed too broad in places. There are still instances in which directories are excluded as a whole, but the list now is more tailored to non-inclusive occurrences that are beyond the purview of Airflow, history, dev/test files, etc.

* Update links in get_pandas_df() of BigQueryHook

* dummy-command -> placeholder-command in _KubernetesDecoratedOperator
  • Loading branch information
josh-fell committed Feb 22, 2023
1 parent 47ebe99 commit dba390e
Show file tree
Hide file tree
Showing 24 changed files with 76 additions and 48 deletions.
62 changes: 45 additions & 17 deletions .pre-commit-config.yaml
Expand Up @@ -438,7 +438,7 @@ repos:
name: Check for language that we do not accept as community
description: Please use more appropriate words for community documentation.
entry: >
(?i)
(?ix)
(black|white)[_-]?list|
\bshe\b|
\bhe\b|
Expand All @@ -451,25 +451,53 @@ repos:
pass_filenames: true
exclude: >
(?x)
^airflow/api_connexion/openapi/v1\.yaml$|
^airflow/cli/commands/webserver_command\.py$|
^airflow/config_templates/config\.yml$|
^airflow/config_templates/default_airflow\.cfg$|
^airflow/providers/|
^airflow/www/fab_security/manager\.py$|
^airflow/www/static/|
^docs/.*$|
^docs/apache-airflow-providers-apache-cassandra/connections/cassandra\.rst$|
^docs/apache-airflow-providers-apache-hive/commits\.rst$|
^tests/cli/commands/test_internal_api_command\.py$|
^tests/cli/commands/test_webserver_command\.py$|
^tests/integration/providers/apache/cassandra/hooks/test_cassandra\.py$|
^tests/providers/|
^tests/providers/apache/cassandra/hooks/test_cassandra\.py$|
^tests/system/providers/apache/spark/example_spark_dag\.py$|
^airflow/api_connexion/openapi/v1.yaml$|
^airflow/cli/commands/webserver_command.py$|
^airflow/config_templates/|
^airflow/models/baseoperator.py$|
^airflow/operators/__init__.py$|
^airflow/providers/amazon/aws/hooks/emr.py$|
^airflow/providers/amazon/aws/operators/emr.py$|
^airflow/providers/apache/cassandra/hooks/cassandra.py$|
^airflow/providers/apache/hive/operators/hive_stats.py$|
^airflow/providers/apache/hive/transfers/vertica_to_hive.py$|
^airflow/providers/apache/spark/hooks/|
^airflow/providers/apache/spark/operators/|
^airflow/providers/exasol/hooks/exasol.py$|
^airflow/providers/google/cloud/hooks/bigquery.py$|
^airflow/providers/google/cloud/operators/cloud_build.py$|
^airflow/providers/google/cloud/operators/dataproc.py$|
^airflow/providers/google/cloud/operators/mlengine.py$|
^airflow/providers/microsoft/azure/hooks/cosmos.py$|
^airflow/providers/microsoft/winrm/hooks/winrm.py$|
^airflow/www/fab_security/manager.py$|
^docs/.*commits.rst$|
^docs/apache-airflow/administration-and-deployment/security/webserver.rst$|
^docs/apache-airflow-providers-apache-cassandra/connections/cassandra.rst$|
^airflow/providers/microsoft/winrm/operators/winrm.py$|
^airflow/providers/opsgenie/hooks/opsgenie.py$|
^airflow/providers/redis/provider.yaml$|
^airflow/serialization/serialized_objects.py$|
^airflow/utils/db.py$|
^airflow/utils/trigger_rule.py$|
^airflow/www/static/css/bootstrap-theme.css$|
^airflow/www/static/js/types/api-generated.ts$|
^airflow/www/templates/appbuilder/flash.html$|
^dev/|
^docs/README.rst$|
^docs/apache-airflow-providers-amazon/secrets-backends/aws-ssm-parameter-store.rst$|
^docs/apache-airflow-providers-apache-hdfs/connections.rst$|
^docs/apache-airflow-providers-google/operators/cloud/kubernetes_engine.rst$|
^docs/apache-airflow-providers-microsoft-azure/connections/azure_cosmos.rst$|
^docs/conf.py$|
^docs/exts/removemarktransform.py$|
^scripts/ci/pre_commit/pre_commit_vendor_k8s_json_schema.py$|
^tests/|
^.pre-commit-config\.yaml$|
^.*CHANGELOG\.(rst|txt)$|
^.*RELEASE_NOTES\.rst$|
^CONTRIBUTORS_QUICK_START.rst$|
^.*\.(png|gif|jp[e]?g|tgz|lock)$|
git
- id: check-base-operator-partial-arguments
name: Check BaseOperator and partial() arguments
Expand Down
8 changes: 4 additions & 4 deletions airflow/api_connexion/endpoints/connection_endpoint.py
Expand Up @@ -178,16 +178,16 @@ def test_connection() -> APIResponse:
"""
Test an API connection.
This method first creates an in-memory dummy conn_id & exports that to an
This method first creates an in-memory transient conn_id & exports that to an
env var, as some hook classes tries to find out the conn from their __init__ method & errors out
if not found. It also deletes the conn id env variable after the test.
"""
body = request.json
dummy_conn_id = get_random_string()
conn_env_var = f"{CONN_ENV_PREFIX}{dummy_conn_id.upper()}"
transient_conn_id = get_random_string()
conn_env_var = f"{CONN_ENV_PREFIX}{transient_conn_id.upper()}"
try:
data = connection_schema.load(body)
data["conn_id"] = dummy_conn_id
data["conn_id"] = transient_conn_id
conn = Connection(**data)
os.environ[conn_env_var] = conn.get_uri()
status, message = conn.test_connection()
Expand Down
Expand Up @@ -22,7 +22,7 @@
apiVersion: v1
kind: Pod
metadata:
name: dummy-name
name: placeholder-name
spec:
containers:
- env:
Expand Down
Expand Up @@ -22,7 +22,7 @@
apiVersion: v1
kind: Pod
metadata:
name: dummy-name
name: placeholder-name
spec:
containers:
- env:
Expand Down
6 changes: 3 additions & 3 deletions airflow/kubernetes_executor_templates/basic_template.yaml
Expand Up @@ -18,14 +18,14 @@
kind: Pod
apiVersion: v1
metadata:
name: dummy-name-dont-delete
namespace: dummy-name-dont-delete
name: placeholder-name-dont-delete
namespace: placeholder-name-dont-delete
labels:
mylabel: foo
spec:
containers:
- name: base
image: dummy-name-dont-delete
image: placeholder-name-dont-delete
env:
- name: AIRFLOW__CORE__EXECUTOR
value: LocalExecutor
Expand Down
2 changes: 1 addition & 1 deletion airflow/providers/cncf/kubernetes/decorators/kubernetes.py
Expand Up @@ -73,7 +73,7 @@ def __init__(self, namespace: str = "default", use_dill: bool = False, **kwargs)
super().__init__(
namespace=namespace,
name=kwargs.pop("name", f"k8s_airflow_pod_{uuid.uuid4().hex}"),
cmds=["dummy-command"],
cmds=["placeholder-command"],
**kwargs,
)

Expand Down
2 changes: 1 addition & 1 deletion airflow/providers/docker/decorators/docker.py
Expand Up @@ -90,7 +90,7 @@ def __init__(
expect_airflow: bool = True,
**kwargs,
) -> None:
command = "dummy command"
command = "placeholder command"
self.python_command = python_command
self.expect_airflow = expect_airflow
self.pickling_library = dill if use_dill else pickle
Expand Down
4 changes: 2 additions & 2 deletions airflow/providers/google/cloud/hooks/bigquery.py
Expand Up @@ -240,8 +240,8 @@ def get_pandas_df(
query. The DbApiHook method must be overridden because Pandas
doesn't support PEP 249 connections, except for SQLite. See:
https://github.com/pydata/pandas/blob/master/pandas/io/sql.py#L447
https://github.com/pydata/pandas/issues/6900
https://github.com/pandas-dev/pandas/blob/055d008615272a1ceca9720dc365a2abd316f353/pandas/io/sql.py#L415
https://github.com/pandas-dev/pandas/issues/6900
:param sql: The BigQuery SQL to execute.
:param parameters: The parameters to render the SQL query with (not
Expand Down
2 changes: 1 addition & 1 deletion airflow/providers/google/provider.yaml
Expand Up @@ -434,7 +434,7 @@ integrations:
- /docs/apache-airflow-providers-google/operators/cloud/workflows.rst
tags: [gcp]
- integration-name: Google LevelDB
external-doc-url: https://github.com/google/leveldb/blob/master/doc/index.md
external-doc-url: https://github.com/google/leveldb/blob/main/doc/index.md
how-to-guide:
- /docs/apache-airflow-providers-google/operators/leveldb/leveldb.rst
tags: [google]
Expand Down
4 changes: 2 additions & 2 deletions airflow/utils/dag_edges.py
Expand Up @@ -26,9 +26,9 @@ def dag_edges(dag: DAG):
Create the list of edges needed to construct the Graph view.
A special case is made if a TaskGroup is immediately upstream/downstream of another
TaskGroup or task. Two dummy nodes named upstream_join_id and downstream_join_id are
TaskGroup or task. Two proxy nodes named upstream_join_id and downstream_join_id are
created for the TaskGroup. Instead of drawing an edge onto every task in the TaskGroup,
all edges are directed onto the dummy nodes. This is to cut down the number of edges on
all edges are directed onto the proxy nodes. This is to cut down the number of edges on
the graph.
For example: A DAG with TaskGroups group1 and group2:
Expand Down
4 changes: 2 additions & 2 deletions airflow/utils/task_group.py
Expand Up @@ -375,7 +375,7 @@ def child_id(self, label):
@property
def upstream_join_id(self) -> str:
"""
If this TaskGroup has immediate upstream TaskGroups or tasks, a dummy node called
If this TaskGroup has immediate upstream TaskGroups or tasks, a proxy node called
upstream_join_id will be created in Graph view to join the outgoing edges from this
TaskGroup to reduce the total number of edges needed to be displayed.
"""
Expand All @@ -384,7 +384,7 @@ def upstream_join_id(self) -> str:
@property
def downstream_join_id(self) -> str:
"""
If this TaskGroup has immediate downstream TaskGroups or tasks, a dummy node called
If this TaskGroup has immediate downstream TaskGroups or tasks, a proxy node called
downstream_join_id will be created in Graph view to join the outgoing edges from this
TaskGroup to reduce the total number of edges needed to be displayed.
"""
Expand Down
2 changes: 1 addition & 1 deletion chart/files/pod-template-file.kubernetes-helm-yaml
Expand Up @@ -23,7 +23,7 @@
apiVersion: v1
kind: Pod
metadata:
name: dummy-name
name: placeholder-name
labels:
tier: airflow
component: worker
Expand Down
2 changes: 1 addition & 1 deletion chart/values.schema.json
Expand Up @@ -5242,7 +5242,7 @@
"x-docsSection": "Airflow",
"default": null,
"examples": [
"apiVersion: v1\nkind: Pod\nmetadata:\n name: dummy-name\n labels:\n tier: airflow\n component: worker\n release: {{ .Release.Name }}\nspec:\n priorityClassName: high-priority\n containers:\n - name: base\n ..."
"apiVersion: v1\nkind: Pod\nmetadata:\n name: placeholder-name\n labels:\n tier: airflow\n component: worker\n release: {{ .Release.Name }}\nspec:\n priorityClassName: high-priority\n containers:\n - name: base\n ..."
]
},
"dags": {
Expand Down
2 changes: 1 addition & 1 deletion chart/values.yaml
Expand Up @@ -1840,7 +1840,7 @@ podTemplate: ~
# apiVersion: v1
# kind: Pod
# metadata:
# name: dummy-name
# name: placeholder-name
# labels:
# tier: airflow
# component: worker
Expand Down
Expand Up @@ -48,5 +48,5 @@ Reference

For further information, look at:

* `Product Documentation <https://github.com/google/leveldb/blob/master/doc/index.md>`__
* `Product Documentation <https://github.com/google/leveldb/blob/main/doc/index.md>`__
* `Client Library Documentation <https://plyvel.readthedocs.io/en/latest/>`__
2 changes: 1 addition & 1 deletion docs/apache-airflow/authoring-and-scheduling/plugins.rst
Expand Up @@ -161,7 +161,7 @@ Make sure you restart the webserver and scheduler after making changes to plugin
Example
-------

The code below defines a plugin that injects a set of dummy object
The code below defines a plugin that injects a set of illustrative object
definitions in Airflow.

.. code-block:: python
Expand Down
2 changes: 1 addition & 1 deletion docs/apache-airflow/authoring-and-scheduling/timezone.rst
Expand Up @@ -96,7 +96,7 @@ words if you have a default time zone setting of ``Europe/Amsterdam`` and create
start_date=pendulum.datetime(2017, 1, 1, tz="UTC"),
default_args={"retries": 3},
)
op = BashOperator(task_id="dummy", bash_command="Hello World!", dag=dag)
op = BashOperator(task_id="hello_world", bash_command="Hello World!", dag=dag)
print(op.retries) # 3
Unfortunately, during DST transitions, some datetimes don't exist or are ambiguous.
Expand Down
2 changes: 1 addition & 1 deletion docs/apache-airflow/best-practices.rst
Expand Up @@ -690,7 +690,7 @@ A *better* way (though it's a bit more manual) is to use the ``dags pause`` comm
Add "integration test" DAGs
---------------------------

It can be helpful to add a couple "integration test" DAGs that use all the common services in your ecosystem (e.g. S3, Snowflake, Vault) but with dummy resources or "dev" accounts. These test DAGs can be the ones you turn on *first* after an upgrade, because if they fail, it doesn't matter and you can revert to your backup without negative consequences. However, if they succeed, they should prove that your cluster is able to run tasks with the libraries and services that you need to use.
It can be helpful to add a couple "integration test" DAGs that use all the common services in your ecosystem (e.g. S3, Snowflake, Vault) but with test resources or "dev" accounts. These test DAGs can be the ones you turn on *first* after an upgrade, because if they fail, it doesn't matter and you can revert to your backup without negative consequences. However, if they succeed, they should prove that your cluster is able to run tasks with the libraries and services that you need to use.

For example, if you use an external secrets backend, make sure you have a task that retrieves a connection. If you use KubernetesPodOperator, add a task that runs ``sleep 30; echo "hello"``. If you need to write to s3, do so in a test task. And if you need to access a database, add a task that does ``select 1`` from the server.

Expand Down
2 changes: 1 addition & 1 deletion docs/apache-airflow/core-concepts/dags.rst
Expand Up @@ -258,7 +258,7 @@ Often, many Operators inside a DAG need the same set of default arguments (such
schedule="@daily",
default_args={"retries": 2},
):
op = BashOperator(task_id="dummy", bash_command="Hello World!")
op = BashOperator(task_id="hello_world", bash_command="Hello World!")
print(op.retries) # 2
Expand Down
2 changes: 1 addition & 1 deletion docs/apache-airflow/core-concepts/params.rst
Expand Up @@ -139,7 +139,7 @@ JSON Schema Validation
# a required param which can be of multiple types
# a param must have a default value
"dummy": Param(5, type=["null", "number", "string"]),
"multi_type_param": Param(5, type=["null", "number", "string"]),
# an enum param, must be one of three values
"enum_param": Param("foo", enum=["foo", "bar", 42]),
Expand Down
2 changes: 1 addition & 1 deletion docs/apache-airflow/howto/connection.rst
Expand Up @@ -214,7 +214,7 @@ for description on how to add custom providers.

The custom connection types are defined via Hooks delivered by the providers. The Hooks can implement
methods defined in the protocol class :class:`~airflow.hooks.base_hook.DiscoverableHook`. Note that your
custom Hook should not derive from this class, this class is a dummy example to document expectations
custom Hook should not derive from this class, this class is an example to document expectations
regarding about class fields and methods that your Hook might define. Another good example is
:py:class:`~airflow.providers.jdbc.hooks.jdbc.JdbcHook`.

Expand Down
2 changes: 1 addition & 1 deletion docs/apache-airflow/tutorial/fundamentals.rst
Expand Up @@ -393,7 +393,7 @@ which are used to populate the run schedule with task instances from this DAG.
What's Next?
-------------
That's it! You have written, tested and backfilled your very first Airflow
pipeline. Merging your code into a repository that has a master scheduler
pipeline. Merging your code into a repository that has a Scheduler
running against it should result in being triggered and run every day.

Here are a few things you might want to do next:
Expand Down
2 changes: 1 addition & 1 deletion docs/helm-chart/customizing-workers.rst
Expand Up @@ -57,7 +57,7 @@ As an example, let's say you want to set ``priorityClassName`` on your workers:
apiVersion: v1
kind: Pod
metadata:
name: dummy-name
name: placeholder-name
labels:
tier: airflow
component: worker
Expand Down
2 changes: 1 addition & 1 deletion scripts/in_container/check_environment.sh
Expand Up @@ -116,7 +116,7 @@ function startairflow_if_requested() {
. "$( dirname "${BASH_SOURCE[0]}" )/configure_environment.sh"

airflow db init
airflow users create -u admin -p admin -f Thor -l Adminstra -r Admin -e dummy@dummy.email
airflow users create -u admin -p admin -f Thor -l Adminstra -r Admin -e admin@email.domain

. "$( dirname "${BASH_SOURCE[0]}" )/run_init_script.sh"

Expand Down

0 comments on commit dba390e

Please sign in to comment.